Improve bill file storage structure

martinszy commented 10 years ago

This might sound just like an aesthetic improvement, so you might not want to give it a high priority, but I think should be stored in one folder per year, this way it's easier to count the bills in one year with find and wc.

When I'm processing the files, it will easier to tell if I have processed all the files from the same folder/year. Right now, with that mess of meaningless numbers I get confused.

Inside that folder, they should be split between s and d, and there I think they should be all together.

Also, adding .json to the files will make it easier for github to display the file.

Example: 2012/s/5454-s-2013.json 2012/d/5453-d-2013.json

2010/s/1253-s-2010.json 2010/d/1253-d-2010.json

seykron commented 10 years ago

With this granularity it seems that there will be ~7K of files per directory, which sounds ok, so will be it. In the meanwhile, if you want to count files per-year or per-source you can make something like:

find ./bills -name *-[source]-[year] -type f | wc -l

Where [source] is either D or S or PE (there're a couple more I don't remember) and [year] is the full year you want to count (i.e.: 2012):

$ find ./bills -name *-D-2013 -type f | wc -l
6542

seykron commented 9 years ago

@martinszy I finally dropped this structure. I'll use a single file in order to implement #10 because I need to build an index from items. Opening/closing files is too expensive, even for offline processes. It will help the dataset distribution as well...

seykron / ogov-importer

Improve bill file storage structure #7