r-bioinformatics / edgePy

A Python implementation of edgeR for differential expression analysis
https://edgepy.readthedocs.io
MIT License
96 stars 10 forks source link

read_handle() needs to strip off non-genes from data import #54

Closed apfejes closed 6 years ago

apfejes commented 6 years ago

the current read_handle() function for reading from a .txt.gz file does not handle stripping out lines that are non-genes, such as "no_function".

See included GSE49712 data set in edgepy/data/ - the last 5 lines should all be dropped.

Will have to recreate the .npz file as well. (you can do that by including the line:

dge_list_first.write_npz_file(filename='/tmp/GSE49712_HTSeq.txt')

in test_cycle_dge_npz()

and then moving the resulting file (after running unit tests) in /tmp/ into the edgePy/data directory

Carbyne commented 6 years ago

From reading edgeR and DESeq2 src code, it seems that we simply need to filter out lines starting with an underscore (called MetaTags in edgeR and specialRows in DESeq2). Also specifically filter out the lines: ["no_feature", "ambiguous", "too_low_aQual", "not_aligned", "alignment_not_unique"], as these were made with older versions of HTseq.

apfejes commented 6 years ago

sounds reasonable. that's pretty much the same rows I saw that need to be filtered out. I did not see any genes starting with an underscore, however.

Carbyne commented 6 years ago

HTseq documentation calls them "special counters", and they are now prefixed with two underscores. I haven't been able to find any other rna-seq counting tools using underscore prefixes though.

sridhar0605 commented 6 years ago

or just use this if "__" not in row[0]:

apfejes commented 6 years ago

I think you're looking for if not row[0].startswith("__"):

sridhar0605 commented 6 years ago

Yes, that is right

__alignment_not_unique
__ambiguous
__no_feature
__not_aligned
__too_low_aQual
Carbyne commented 6 years ago

Should we inform the user of these rows being filtered out, or just do it silently?

apfejes commented 6 years ago

I think it's fine to do it silently - there are no valid gene names in any system that start with "__" . That said, it should be in our documentation.

apfejes commented 6 years ago

This was done by @Carbyne - included in #59