openstates / issues

Having trouble? Looking to contribute? Issues live here!
15 stars 2 forks source link

Texas bill texts missing characters `A` #1165

Open tuanpham96 opened 7 months ago

tuanpham96 commented 7 months ago

I'm reporting issues I'm seeing on PluralPolicy. If this is not the right place, please let me know where to direct this.

Example URL:

Browser: Both Firefox and Chrome have this issue

Issue: The characters A are missing in the bill text section; not all the time, but usually when it appears in in between of ( ) or in actual double quotes " ". Turning on/off markup doesn't matter. I haven't checked which other bills or states may have this problem. But I spotted two examples for this.

Below are the screenshots, comparing between what's on the PluralPolicy side and what's in the source PDFs

image

image

Note: I also double-checked the bulk json on OpenStates and this problem does not seem to appear in the bulk json.

NewAgeAirbender commented 7 months ago

Thanks for raising this. It's an issue with Plural since Open States doesn't have bill version text available like that, but I'm pretty sure it's an issue because the TX pdfs have white 'A' characters to substitute for spacing so we had to try replacing the excess As & it may have removed some that weren't meant to be removed. I'll take a look & see if that is still how TX spaces their bills & we still need that removal logic.

tuanpham96 commented 7 months ago

Thanks for the explanation!

May I ask how differently Plural and OpenStates source the text?

When doing text analysis, which one should I rely on more, not only for TX but other states as well?

NewAgeAirbender commented 7 months ago

Open States provides the links to Plural, Plural does a separate text extraction & processing that gets bill text for each version. Open States only processes text for search purposes & doesn't save each bill's version text.