nytimes / Fech

Deprecated. Please see https://github.com/dwillis/Fech for a maintained fork.
http://nytimes.github.io/Fech/
Other
115 stars 30 forks source link

Major overhaul of source mapping #75

Open saizai opened 9 years ago

saizai commented 9 years ago

I noticed that a lot of mappings were a) just wrong (e.g. linked to the wrong record, like col a vs col b, or the wrong version number / line item) b) missing (e.g. no field to capture some data in a record) c) duplicated (e.g. multiple fields mapped to the same name) d) inconsistently named e) not well segregated (e.g. comma or newline within fields that aren't escaped and are comma/newline separated)

So I'm working on a major overhaul of the source mapping, deriving directly from the e-filing headers all versions.xlsx eFilingFormats file. While at it, I'm having it support versions 1 & 2 as well as deprecated forms.

Because the data import will have to be re-done anyway (because of a-c above), I'm being a bit aggressive about making the names consistent and semantic — e.g. total_receipts_ytd instead of col_b_total_receipts. I'm hoping to reduce the total number of canonical field names from the current ~1.2k to something a bit more sane. ;-)

The new version will have a regex based mapping file, with US delimiters (ascii 31) and field type/size data, both to make it easier to edit in the future and to be able to automatically output a database migration file.

I'm expecting to be done in about a week and will make a pull request then. Right now it's not in a fully consistent state.

So @dwillis et al, please hold off on working on this part of the code for the moment.

(Also, I'll be publishing an .sql.gz dump of the full import to date.)

dwillis commented 9 years ago

This is a pretty significant undertaking; I appreciate the effort. I do want to say one thing about conventions: using something like ytd all the time isn't correct; in some cases col_b is cycle-to-date, and we want to reflect that where we can. Reducing the list of canonical field names is something I'm very interested in, but want to make sure it doesn't lose anything we actually need.

saizai commented 9 years ago

Agreed re. naming. I intend to distinguish them properly. Thanks for reminding me of the YTD vs CTD difference. (Do you happen to remember what forms use it?)

In any case, it'd not risk losing anything. I'm enforcing unique names per version and row type. Might just require a review to ensure the names are apt; if they aren't, with my new scheme, renaming a column is very easy.

dwillis commented 9 years ago

Makes sense. e-Filing formats 5.0-6.3 use cycle-to-date for form F10 and Form 3 uses it as well for Schedule A.

saizai commented 9 years ago

Got it. Will check.