onyxfish / votersdaily

A project to parse the content of diverse government schedules into a consistent format.
GNU General Public License v3.0
12 stars 3 forks source link

All scrapers should set document id's in the form [datetime] - [parser_name] - [unique key] #58

Open onyxfish opened 15 years ago

onyxfish commented 15 years ago

Where unique keys is whatever is appropriate to a given scraper. For Roll Call Vote scrapers this would be Roll #. For some scrapers this may be title--whatever makes a given event unique.

onyxfish commented 15 years ago

This has now been documented in the Database Planning section of the wiki: http://wiki.github.com/bouvard/votersdaily/database-planning

onyxfish commented 15 years ago

Fixed for Python scrapers. This is def. a much better way of identifying each document.

chaunceyt commented 15 years ago

fixed closing.

onyxfish commented 15 years ago

It looks like the scrapers are still pulling in branch and entity names in the format: [datetime] - [parser_name] - [branch] - [entity] - [unique key]. Now that we are including parser name I think we should remove [branch] and [entity]. They really only make the id's longer and I'm already a bit concerned that some of our URL's are going to be overly lengthy.

Also, for the Roll Call Votes scrapers where there is a unique Vote Number, I really think we want to use that as the [unique key] portion rather than the title.

Going to reopen this ticket, pending discussion.

chaunceyt commented 15 years ago

will work on this week.