Closed apfejes closed 6 years ago
From reading edgeR and DESeq2 src code, it seems that we simply need to filter out lines starting with an underscore (called MetaTags in edgeR and specialRows in DESeq2). Also specifically filter out the lines: ["no_feature", "ambiguous", "too_low_aQual", "not_aligned", "alignment_not_unique"], as these were made with older versions of HTseq.
sounds reasonable. that's pretty much the same rows I saw that need to be filtered out. I did not see any genes starting with an underscore, however.
HTseq documentation calls them "special counters", and they are now prefixed with two underscores. I haven't been able to find any other rna-seq counting tools using underscore prefixes though.
or just use this if "__" not in row[0]:
I think you're looking for if not row[0].startswith("__"):
Yes, that is right
__alignment_not_unique
__ambiguous
__no_feature
__not_aligned
__too_low_aQual
Should we inform the user of these rows being filtered out, or just do it silently?
I think it's fine to do it silently - there are no valid gene names in any system that start with "__" . That said, it should be in our documentation.
This was done by @Carbyne - included in #59
the current read_handle() function for reading from a .txt.gz file does not handle stripping out lines that are non-genes, such as "no_function".
See included GSE49712 data set in edgepy/data/ - the last 5 lines should all be dropped.
Will have to recreate the .npz file as well. (you can do that by including the line:
dge_list_first.write_npz_file(filename='/tmp/GSE49712_HTSeq.txt')
in test_cycle_dge_npz()
and then moving the resulting file (after running unit tests) in /tmp/ into the edgePy/data directory