Closed grst closed 6 years ago
@mlist, can we discuss that again tomorrow?
Sure, let's talk after lunch
Gregor Sturm notifications@github.com schrieb am Di., 23. Okt. 2018, 10:04:
@mlist https://github.com/mlist, can we discuss that again tomorrow?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/grst/single_cell_data_integration/issues/1#issuecomment-432139960, or mute the thread https://github.com/notifications/unsubscribe-auth/ABVg3bkauELElDx-CPXAjCkbbJWjHhhQks5uns2MgaJpZM4XwRGs .
For now, I'm mapping to ENSMBL using Biomart. Non-mapping or multiple mapping genes are excluded, sacrificing ~1500-2000 genes per dataset.
For a consistently processed and mapped version of the dataset see #4.
The consensus seems to be that ENSEMBL is the best. So, if possible, I will merge the datasets based on ENSEMBL ids.
For the following datasets, ensemble ids are available:
For the following datasets only HGNC/entrez ids are available:
Ideally, we would reprocess the above datasets. But due to time constraints and limited data availability we might not be able to do that.
Proposal: map all identifiers to ENSEMBLE before merging
Problem: what to do with one (HGNC) to many (ENSEMBLE) mappings?
This is not a negligible issue:
2316 ensemble ids map to more than one hgnc symbol.
mart_export.txt