Closed martinszy closed 9 years ago
With this granularity it seems that there will be ~7K of files per directory, which sounds ok, so will be it. In the meanwhile, if you want to count files per-year or per-source you can make something like:
find ./bills -name *-[source]-[year] -type f | wc -l
Where [source] is either D or S or PE (there're a couple more I don't remember) and [year] is the full year you want to count (i.e.: 2012):
$ find ./bills -name *-D-2013 -type f | wc -l
6542
@martinszy I finally dropped this structure. I'll use a single file in order to implement #10 because I need to build an index from items. Opening/closing files is too expensive, even for offline processes. It will help the dataset distribution as well...
This might sound just like an aesthetic improvement, so you might not want to give it a high priority, but I think should be stored in one folder per year, this way it's easier to count the bills in one year with find and wc.
When I'm processing the files, it will easier to tell if I have processed all the files from the same folder/year. Right now, with that mess of meaningless numbers I get confused.
Inside that folder, they should be split between s and d, and there I think they should be all together.
Also, adding .json to the files will make it easier for github to display the file.
Example: 2012/s/5454-s-2013.json 2012/d/5453-d-2013.json
2010/s/1253-s-2010.json 2010/d/1253-d-2010.json