grst / single_cell_data_integration

1 stars 0 forks source link

gene identifier remapping #1

Closed grst closed 5 years ago

grst commented 5 years ago

The consensus seems to be that ENSEMBL is the best. So, if possible, I will merge the datasets based on ENSEMBL ids.

For the following datasets, ensemble ids are available:

For the following datasets only HGNC/entrez ids are available:

Ideally, we would reprocess the above datasets. But due to time constraints and limited data availability we might not be able to do that.

Proposal: map all identifiers to ENSEMBLE before merging

Problem: what to do with one (HGNC) to many (ENSEMBLE) mappings?

This is not a negligible issue:

>cut -f 1,2 mart_export.txt | sort -u | cut -f 2 | sort | uniq -c | sort -rn  | grep -vP "^(\s*)1 " | wc -l
2316

2316 ensemble ids map to more than one hgnc symbol.

mart_export.txt

grst commented 5 years ago

@mlist, can we discuss that again tomorrow?

mlist commented 5 years ago

Sure, let's talk after lunch

Gregor Sturm notifications@github.com schrieb am Di., 23. Okt. 2018, 10:04:

@mlist https://github.com/mlist, can we discuss that again tomorrow?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grst/single_cell_data_integration/issues/1#issuecomment-432139960, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVg3bkauELElDx-CPXAjCkbbJWjHhhQks5uns2MgaJpZM4XwRGs .

grst commented 5 years ago

For now, I'm mapping to ENSMBL using Biomart. Non-mapping or multiple mapping genes are excluded, sacrificing ~1500-2000 genes per dataset.

For a consistently processed and mapped version of the dataset see #4.