parsing from sRNAbench data (II)

JFsanchezherrero commented 4 years ago

Hi there,

We (@lsumoy and I) have found an unexpected result from miRTop when parsing sRNAbench data. It is somehow related to the previous issue #53 but not entirely that is we generated a new one.

We came into this because we are working on an implementation of miRTop results to generate a matrix, as it could be useful for DE containing all information regarding canonical, mature, variants and license plate information. As stated before https://github.com/miRTop/mirtop/issues/53#issuecomment-467639223, we might be interested to contribute to the code.

We think the issue shown here must be fixed from other people familiarized with the miRTop code and so we are reporting it.

Expected behavior and actual behavior.

We have found that there is a conflict in the sum count of expression per microRNA isomir when parsed from sRNAbench to miRTop gff.

I have generated some tests and it all concludes that parsing is avoiding to include variant type "mv" (multiple variants) (among others) from microRNAannotation.txt and reads.annotation from sRNAbench. That is generating an imbalance when obtaining total counts: e.g. hsa-miR-10b-5p sRNAbench: 61136 miRTop: 58429

Steps to reproduce the problem.

I have included a couple of examples in these files with the example from below and other: microRNAannotation.txt reads.annotation.txt

jsanchez@cacau:test$ grep 'hsa-miR-10b-5p' microRNAannotation.txt | awk '{sum += $6} END {print sum}' 61136 jsanchez@cacau:test$ grep 'hsa-miR-10b-5p' microRNAannotation.txt | grep -v 'mv' | awk '{sum += $6} END {print sum}' 58601

I repeated the same command for the different variant types identified for this microRNA and sample:

lv3p: 53452 nta: 3233 mv: 2535 exact: 1391 lv5p: 439 exactNucVar: 84 mlv3p: 2

I can not cleary see what is going one and missing here. I can not reproduce the total sum count so I guess, among not counting mv variants, some others variants might be not included.

I also include here the gff file generated by miRTop. miRTop.gff.txt

Specifications like the version of the project, operating system, or hardware.

We are running this on: debian8.10 linux

python2.7 mirtop (0.3.17)

miRBase v22 genome-build-id: GRCh38 genome-build-accession: NCBI_Assembly:GCA_000001405.15

lpantano commented 4 years ago

Thanks a lot for this. When I was implementing this, I realized some of the variants cannot be parsed to GFF. I can take a look into that since it is very important.

Can you paste the information mirtop print while running the conversion from sRNAbench to GFF?

JFsanchezherrero commented 4 years ago

Hi there Lorena, They definitely seem important, at least for some samples.

We think that these variants must be complicated to add them into any previous given category, but it could be appropiate to include them in the gff, even with a common name such as "non classified". If done so, then, you would be able to sum all counts, at the level of variant, canonical or isomir, and it will be the same total number. Also, whenever you want you can always change the category of this variant or better classify them among others.

Here is the information generated during the creation of gff by mirtop. This is stored in run.log

INFO-mirtop.libs.logger(27): Run annotation
INFO-mirtop.libs.logger(47): Reads with isomiR information 19317
INFO-mirtop.libs.logger(131): Loaded 568 reads with 24382 hits
INFO-mirtop.libs.logger(132): Reads without precursor information: 1601
INFO-mirtop.libs.logger(134): Reads with MV as variant definition, not supported by GFF: 1829
INFO-mirtop.libs.logger(135): Hit Filtered by having > 3 changes: 0
INFO-mirtop.libs.logger(49): It took 0.060 minutes

I can read in the log that there are MV variants (1829) that could not be included but this number neither is the same as stated before (2535) that I can count from sRNAbench microRNAannotation.txt file. I guess, there is something else missing here.

Thank you very much in advance

JFsanchezherrero commented 4 years ago

Hi there,

I have just realized that I made mistake in the previous comment.

The number reported in this run.log output when generating gff, the 1829 reads with MV variants must be single entries. I mean that each entry can have multiple reads mapping to a given miRNA with a variant type. This 1829 number is the result of parsing the sRNAbench result for all the miRNA identified in this sample and condition. This has nothing to do with the 2535 that are read counts misbalanced for a given example miRNA.

I have checked the total number of entries (= lines) in microRNAannotation file for this sample containing mv as a variant annotation, including others or single (e.g. mv, mv$lv3p, ...) and it accounts for 1821. (It is neither the same number reported but at least it is very close).

PreviousIy I only attached here in this issue, as an example, a couple of miRNA example annotations with a clear misbalance in total sum counts between sRNAbench and miRTop. If you feel like it is necessary I can send you whole files generated by sRNAbench, or at least the whole microRNAannotation and reads.annotation.

Thanks

miRTop / mirtop