gene identifier remapping

grst commented 6 years ago

The consensus seems to be that ENSEMBL is the best. So, if possible, I will merge the datasets based on ENSEMBL ids.

For the following datasets, ensemble ids are available:

zheng_bileas_2017
savas_loi_2018
azizi_peer_2018_10x
lambrechts2018{v1,v2,6653} (processed ourselves)

For the following datasets only HGNC/entrez ids are available:

zheng_zhang_2017 (raw data only available upon request)
guo_zhang_2018 (raw data only available upon request)
azizi_peer_2018_indrop (raw data available)

Ideally, we would reprocess the above datasets. But due to time constraints and limited data availability we might not be able to do that.

Proposal: map all identifiers to ENSEMBLE before merging

Problem: what to do with one (HGNC) to many (ENSEMBLE) mappings?

This is not a negligible issue:

>cut -f 1,2 mart_export.txt | sort -u | cut -f 2 | sort | uniq -c | sort -rn  | grep -vP "^(\s*)1 " | wc -l
2316

2316 ensemble ids map to more than one hgnc symbol.

mart_export.txt

grst commented 6 years ago

@mlist, can we discuss that again tomorrow?

mlist commented 6 years ago

Sure, let's talk after lunch

Gregor Sturm notifications@github.com schrieb am Di., 23. Okt. 2018, 10:04:

@mlist https://github.com/mlist, can we discuss that again tomorrow?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grst/single_cell_data_integration/issues/1#issuecomment-432139960, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVg3bkauELElDx-CPXAjCkbbJWjHhhQks5uns2MgaJpZM4XwRGs .

grst commented 6 years ago

For now, I'm mapping to ENSMBL using Biomart. Non-mapping or multiple mapping genes are excluded, sacrificing ~1500-2000 genes per dataset.

For a consistently processed and mapped version of the dataset see #4.

grst / single_cell_data_integration

gene identifier remapping #1

Proposal: map all identifiers to ENSEMBLE before merging