Closed jaklinger closed 4 years ago
Note: "bad greek character" encoding (which is already in the raw data, nothing to do with us) can only reasonably achieved with data science to infer values. The cost-benefit of this is likely quite shaky.
e.g.
The "?" in "?-synuclein" could be 'alpha', 'beta' or 'gamma'. Without apply some kind of likelihood based metric, we will be inferring the wrong character.
There are other data sciency data cleaning tasks that could be performed, such as joining together terms separated by hyphens, if they correctly match context. This would be to fix line breaks (e.g. hell- o world).
Neither of the above will be fixed in this PR, it's simply too time consuming and likely to achieve ok results at best.
Addresses (but doesn't close) #326 Closes #51
This is the first in a string of PRs for refactoring NiH collection and processing pipelines to be non-project specific.
Tasks for this PR:
Cleaning steps:
;
terms into an array- [ ] Check Greek characters (etc) parse ok(see comment below)[]
-->null
You can test this pipeline with:
Future PRs will: