hagenaue / Brain_GMT

Code and documentation for the brain-related gene set (Brain.GMT) curation project
MIT License
5 stars 1 forks source link

Duplicate Set Names #7

Open eturkes opened 2 months ago

eturkes commented 2 months ago

When using GSEABase::getGmt(), a common function for reading in GMT files, I get an error saying that each setName must be distinct. This refers to column V1 in the attached screenshot. After removing duplicates in V1 I'm able to use getGMT(), so I recommend removing them from the gene sets.

Thanks for the resource, this is a great idea, the examples from your paper: “SPERM MOTILITY”, “HEART MORPHOGENESIS” are some of the ones I'm always seeing come up in my brain-specific enrichment analyses.

Screenshot_20240923_152502

hagenaue commented 2 months ago

Ooh - thanks for the bug report! I'm glad that you have found it to be useful! I'm hoping to make an improved version of Brain.GMT soon, so if you have any other gene set sources that you would like added, please feel free to send requests!

On Mon, Sep 23, 2024 at 11:00 AM Emir Turkes @.***> wrote:

When using GSEABase::getGmt(), a common function for reading in GMT files, I get an error saying that each setName must be distinct. This refers to column V1 in the attached screenshot. After removing duplicates in V1 I'm able to use getGMT(), so I recommend removing them from the gene sets.

Thanks for the resource, this is a great idea, the examples from your paper: “SPERM MOTILITY”, “HEART MORPHOGENESIS” are some of the ones I'm always seeing come up in my brain-specific enrichment analyses.

Screenshot_20240923_152502.png (view on web) https://github.com/user-attachments/assets/83985a31-1af5-4036-897b-ce3d50a4efb2

— Reply to this email directly, view it on GitHub https://github.com/hagenaue/Brain_GMT/issues/7, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6V7IGUASUTZSLCPPYJ7DDZYAUJFAVCNFSM6AAAAABOWJAEW6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGU2DEOJTGI3TANQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Megan Hastings Hagenauer, Ph.D.

Assistant Research Scientist in the Michigan Neuroscience Institute Instructor in the Department of Psychology & Neuroscience Graduate Program University of Michigan-Ann Arbor

Pronouns: She, her, hers

eturkes commented 2 months ago

That's great. No sources, I just use GO exclusively myself. Oddly, I do notice a term or two not present in your lists that I feel should be - like this one: https://www.informatics.jax.org/go/term/GO:0033693. At least in the human list, didn't check mouse. I think that's because you're using msigdb as the source, whereas I download the GMT files from g:Profiler (a while back I think I somehow came to the conclusion their's was the most complete and they update often).

A handy thing about their files also is that they contain the GO identifier. In your GMTs I see metadata like this for GO sets: https://www.gsea-msigdb.org/gsea/msigdb/cards/GOBP_MITOCHONDRIAL_GENOME_MAINTENANCE

That's useful of course. But actually I wanted to use your lists to filter down my lists, and for that the GO identifier would be perfect. What I'm referring to is Exact source in the msigdb link, like GO:0000002. Otherwise, our set names are quite different in capitalization and things, so it's hard to filter with that. I suppose it makes more sense to use that metadata slot for the actual source you're using (msigdb), but if you know a way to grab the Exact source for each msigdb, that'd be useful to add in your examples section of this repo.

Other than that, really looking forward to the ENSEMBL IDs :-)

eturkes commented 2 months ago

Actually getting the exact sources is relatively straightforward with this package: https://cran.r-project.org/web/packages/msigdbr/vignettes/msigdbr-intro.html

So I should be good to go!