[326] Refactoring and tidying of NiH collection

nestauk / old_nesta_daps

[archived]

MIT License

18 stars 5 forks source link

Addresses (but doesn't close) #326 Closes #51

This is the first in a string of PRs for refactoring NiH collection and processing pipelines to be non-project specific.

Tasks for this PR:

[x] Backup the old database
[x] Drop all NiH tables
[x] Collect and clean (as much as possible) NiH data

Cleaning steps:

[x] Split ; terms into an array
[x] CAPS --> Camel Case
[x] Address the bad dq issue highlighted in #51 ~~- [ ] Check Greek characters (etc) parse ok~~ (see comment below)
[x] Check question marks (bad unicode parsing) fixed
[x] "NULL", "", "N/A", [] --> null

You can test this pipeline with:

luigi --module nih_collect_task RootTask

Future PRs will:

Geocode and apply country-tagging as a lookup table
Implement a generic highly scalable deduplication strategy (including very close matches), which can be ported to other datasets. Output is a lookup table.
under data-getters, will write a getter for analysis of deduped projects

Note: "bad greek character" encoding (which is already in the raw data, nothing to do with us) can only reasonably achieved with data science to infer values. The cost-benefit of this is likely quite shaky.

e.g.

The "?" in "?-synuclein" could be 'alpha', 'beta' or 'gamma'. Without apply some kind of likelihood based metric, we will be inferring the wrong character.

There are other data sciency data cleaning tasks that could be performed, such as joining together terms separated by hyphens, if they correctly match context. This would be to fix line breaks (e.g. hell- o world).

Neither of the above will be fixed in this PR, it's simply too time consuming and likely to achieve ok results at best.

nestauk / old_nesta_daps

[326] Refactoring and tidying of NiH collection #327