varemo / piano

piano - An R/Bioconductor package for gene set analysis
https://varemo.github.io/piano/
12 stars 4 forks source link

loadGSC fails to read double-quoted files containing apostrophe #3

Open Kupac opened 6 years ago

Kupac commented 6 years ago

The scan function in loadGSC treats both double and single quotes as quoting characters, resulting in misaligned columns when reading the table. A default quote="\"" argument to the scan could fix the issue. An option could be added as well, if we want to use a different quoting character. Thx

varemo commented 6 years ago

Sorry for slow reply! Thanks for reporting this, I will look into adding this in a future release. As a side-note, loadGSC can take a data.frame as input, so if the parsing does not work as intended I would recommend reading the file into R separately and then passing the data.frame to loadGSC.

varemo commented 6 years ago

User-comment by email:

I've been using piano's runGSA for years. It's only after newest updates (either piano or R) that I've run into problem. It seems that if either pathway name or gene name contains '-mark, it breaks something in the code and the result ends up producing copy of the pathway records (pathway name & genes) without the quote. For example GO has these kind of names. This isn't issue in pathway names for they can be modified, but it is for the genes names. For example, drosophila melanogaster has genes called betaCOP and beta'COP. They are different genes and shouldn't be mixed. Of course gene names can be changed into entrez codes but the work it tedious.

My code for running is:

gsares=runGSA(geneLevelStats=stats, gsc=geneset, signifMethod="geneSampling",
        geneSetStat="gsea", nPerm=1000, verbose=T, ncpus=20)

geneset being class of GSC and stats being data frame of t-statistics of genes with gene names as row names. I do not get any errors.