nestauk / old_nesta_daps

[archived]
MIT License
18 stars 5 forks source link

[326] Refactoring and tidying of NiH collection #327

Closed jaklinger closed 4 years ago

jaklinger commented 4 years ago

Addresses (but doesn't close) #326 Closes #51

This is the first in a string of PRs for refactoring NiH collection and processing pipelines to be non-project specific.

Tasks for this PR:

Cleaning steps:

You can test this pipeline with:

luigi --module nih_collect_task RootTask

Future PRs will:

  1. Geocode and apply country-tagging as a lookup table
  2. Implement a generic highly scalable deduplication strategy (including very close matches), which can be ported to other datasets. Output is a lookup table.
  3. under data-getters, will write a getter for analysis of deduped projects
jaklinger commented 4 years ago

Note: "bad greek character" encoding (which is already in the raw data, nothing to do with us) can only reasonably achieved with data science to infer values. The cost-benefit of this is likely quite shaky.

e.g.

The "?" in "?-synuclein" could be 'alpha', 'beta' or 'gamma'. Without apply some kind of likelihood based metric, we will be inferring the wrong character.

There are other data sciency data cleaning tasks that could be performed, such as joining together terms separated by hyphens, if they correctly match context. This would be to fix line breaks (e.g. hell- o world).

Neither of the above will be fixed in this PR, it's simply too time consuming and likely to achieve ok results at best.