propublica / Capitol-Words

Scraping, parsing and indexing the daily Congressional Record to support phrase search over time, and by legislator and date
BSD 3-Clause "New" or "Revised" License
122 stars 34 forks source link

schema.xml missing fields #74

Open AlJohri opened 10 years ago

AlJohri commented 10 years ago

When I symlinked solr/schema.xml to my lucene database and tried to ingest a sample document it would error out on several missing fields:

   <field name="year" type="string" indexed="true" stored="true" required="true" /> 
   <field name="month" type="string" indexed="true" stored="true" required="true" /> 
   <field name="day" type="string" indexed="true" stored="true" required="true" /> 
   <field name="year_month" type="string" indexed="true" stored="true" required="true" /> 
   <field name="page_id" type="string" indexed="true" stored="true" required="true" /> 
   <field name="slug" type="string" indexed="true" stored="true" required="true" /> 
   <field name="congress" type="string" indexed="true" stored="true" required="true" /> 
   <field name="session" type="string" indexed="true" stored="true" required="true" /> 

Should I open a PR and add them in?

P.S. Also the solr version I was using required me to have this field

<field name="_version_" type="long" indexed="true" stored="true"/>
drinks commented 10 years ago

Yeah, those are definitely all needed, a PR would be welcome. Thanks for putting this through its paces, we haven't tried to stand up an instance really since building it.

AlJohri commented 10 years ago

Not a problem. I'm working on settings this up for myself and will see it through until I'm done. What's the easiest way for you guys to look through these changes? Issues/PR for each change? Or one big one?

Just noticed another missing field (I'll add it in, just dropping it here for posterity.

<field name="speaker_middlename" type="string" indexed="true" stored="true" required="false" /> 
<field name="speaker_title" type="string" indexed="true" stored="true" required="false" /> 
drinks commented 10 years ago

Separate PRs would be preferable, but If it's too much trouble to branch each change that's understandable. Looks like I need at a minimum to add another setting for ingest.py to take a solr host. I'll dive into this stuff Tuesday but feel free to keep adding commits as you make them in the meantime. Thanks again!

AlJohri commented 10 years ago

Sure! I can split it up. I didn't think it would be this many changes initially so I just kept committing to master but I can see how its quickly becoming unwieldy if you want to reject anything.

I had a couple questions regarding getting set up, do you have a preferred method of contact? Github has your email as dan.drinkard@gmail.com.

drinks commented 10 years ago

Yep that email will work just fine.

AlJohri commented 10 years ago

@drinks please see https://github.com/sunlightlabs/Capitol-Words/pull/89