miRTop / mirtop

command lines tool to annotate miRNAs with a standard mirna/isomir naming
https://mirtop.readthedocs.org
MIT License
17 stars 21 forks source link

parsing from sRNAbench data #53

Open lsumoy opened 5 years ago

lsumoy commented 5 years ago

Hi,

We recently detected that gff output form mirtop stats gives unexpected results. We got two entries for canonical (variant=NA for many miRNAs instead of a single one (one of the two in fact is a n add-3p=-1 isoform but the miRTop generated gff tags it as NA (canonical).

In many instances wo results come out annotated as 'NA? (i.e. canonical or mature reference in miRBase) in our parsed output:

example: in the case of miR-151a-5p http://www.mirbase.org/cgi-bin/mature.pl?mature_acc=MIMAT0004697 the canonical reference lecenseplate ID is PhkUD0w

9589 isomiR HC_BST_RA_L001_10 hsa-mir-151a hsa-miR-151a-5p PhkUD0 TCGAGGAGCTCACAGTCT 9593 isomiR HC_BST_RA_L001_10 hsa-mir-151a hsa-miR-151a-5p PhkUD0w TCGAGGAGCTCACAGTCTAGT expression 9589 169 9593 1551

Looking at gff files generated by mirtop (please see highlighted (**) entries that have NA as variant):

grep -w 'PhkUD0' HC_BST_RA_L001_10.gff hsa-mir-151a miRBasev22 isomiR 10 29 . + . Read=TCGAGGAGCTCACAGTCTA; UID=PhkUD0@2; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_3p:+1; Cigar=19M; Expression=14; Filter=Pass; Hits=2 hsa-mir-151a miRBasev22 ref_miRNA 10 28 . + . Read=TCGAGGAGCTCACAGTCT; UID=PhkUD0; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=NA; Cigar=18M; Expression=169; Filter=Pass; Hits=2 hsa-mir-151b miRBasev22 isomiR 59 78 . + . Read=TCGAGGAGCTCACAGTCTA; UID=PhkUD0@2; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_3p:-2; Cigar=19M; Expression=14; Filter=Pass; Hits=2 hsa-mir-151b miRBasev22 isomiR 59 77 . + . Read=TCGAGGAGCTCACAGTCT; UID=PhkUD0; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_3p:-3; Cigar=18M; Expression=169; Filter=Pass; Hits=2 hsa-mir-151b miRBasev22 isomiR 59 80 . + . Read=TCGAGGAGCTCACAGTCTAAA; UID=PhkUD0@; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_add:+2; Cigar=19MAM; Expression=1; Filter=Pass; Hits=1

grep -w 'PhkUD0w' HC_BST_RA_L001_10.gff hsa-mir-151a miRBasev22 isomiR 10 32 . + . Read=TCGAGGAGCTCACAGTCTAGTA; UID=PhkUD0w@2; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_3p:+1; Cigar=22M; Expression=657; Filter=Pass; Hits=1 hsa-mir-151a miRBasev22 isomiR 10 33 . + . Read=TCGAGGAGCTCACAGTCTAGTAA; UID=PhkUD0w@1; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_add:+1; Cigar=22MA; Expression=56; Filter=Pass; Hits=1 hsa-mir-151a miRBasev22 isomiR 10 34 . + . Read=TCGAGGAGCTCACAGTCTAGTAAA; UID=PhkUD0w@; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_add:+2; Cigar=22MAA; Expression=2; Filter=Pass; Hits=1 hsa-mir-151a miRBasev22 ref_miRNA 10 31 . + . Read=TCGAGGAGCTCACAGTCTAGT; UID=PhkUD0w; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=NA; Cigar=21M; Expression=1551; Filter=Pass; Hits=1

Steps to reproduce the problem.

/imppc/labs/lslab/jsanchez/python_environment/mirtop_jsanchez/bin/mirtop gff --sps hsa --hairpin /imppc/labs/lslab/aluna/opt/sRNAtoolboxDB/libs/hairpin.fa --gtf /imppc/labs/lslab/aluna/opt/hsa.gff3 --format srnabench -o /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10 /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.1.isomiR_sRNAbenchtoolbox/HC_BST_RA_L001_10 2> /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10_logfile.txt

/imppc/labs/lslab/jsanchez/python_environment/mirtop_jsanchez/bin/mirtop stats -o /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10/stats /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10/HC_BST_RA_L001_10.gff 2>> /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10_logfile.txt

Specifications like the version of the project, operating system, or hardware.

We are running this on: mirtop (0.3.17) python2.7 debian8.10 linux

lpantano commented 5 years ago

Hi,

thanks for reporting this. This may be fixed in the devel version. Can you provide with these two files:

You can create them to contain only these two sequences. I can test it here, and see what is going on.

As well, you can try to use the devel version yourself.

Thanks a lot for spotting this.

Cheers

lsumoy commented 5 years ago

Here are the specific entries for the example in the two sRNAbench output files:

microRNAannotation.txt lines: TCGAGGAGCTCACAGTCTAGT hsa-miR-151a-5p hsa-mir-151a exact - 1551 597.3388956586441 351.1619134905494 TCGAGGAGCTCACAGTCT hsa-miR-151a-5p$hsa-miR-151b hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3 - 169 65.08721687060661 38.263290380337104

reads.annotation.txt lines: TCGAGGAGCTCACAGTCTAGT 1551 351.1619134905494 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,31 1 TCGAGGAGCTCACAGTCT 169 38.263290380337104 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,28$mature#hsa-miR-151b#sense#hsa-mir-151b,60,77 2

reads.annotation.txt microRNAannotation.txt

lpantano commented 5 years ago

sorry, another question:

another question, is this mirbase 21 or 22?

On Wed, Feb 20, 2019 at 8:18 AM lsumoy notifications@github.com wrote:

Here are the specific entries for the example in the two sRNAbench output files:

microRNAannotation.txt lines:

TCGAGGAGCTCACAGTCTAGT hsa-miR-151a-5p hsa-mir-151a exact - 1551 597.3388956586441 351.1619134905494 TCGAGGAGCTCACAGTCT hsa-miR-151a-5p$hsa-miR-151b hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3 - 169 65.08721687060661 38.263290380337104

reads.annotation.txt lines:

TCGAGGAGCTCACAGTCTAGT 1551 351.1619134905494 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,31 1 TCGAGGAGCTCACAGTCT 169 38.263290380337104 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,28$mature#hsa-miR-151b#sense#hsa-mir-151b,60,77 2

reads.annotation.txt https://github.com/miRTop/mirtop/files/2884720/reads.annotation.txt microRNAannotation.txt https://github.com/miRTop/mirtop/files/2884709/microRNAannotation.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/miRTop/mirtop/issues/53#issuecomment-465569864, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HECb-0Aai86aWJa4dAh-pQeJ91tNks5vPUsIgaJpZM4bDglo .

lpantano commented 5 years ago

Hi,

now that I am looking better to the sRNAbench output, I see that there is where the two reads are labeled as exact for hsa-miR-151a-5p. I assume the order is relevant in the columns indicating the miRNA and the variants.

So, for instance, read:

TCGAGGAGCTCACAGTCTAGT says is hsa-mir-151a exact

and read:

TCGAGGAGCTCACAGTCT says is hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3, and I assume the first variant type (exact) is for hsa-mir-151a and the second (lv3p) for hsa-mir-151b.

mirtop is only parsing that. I assume the variants should be switched, it would be worthy to have the opinion of @mlhack on this.

Cheers

JFsanchezherrero commented 5 years ago

Hi Lorena, I reply here to your first question.

It is version 22 for mirBase in this case.

gff-version 3

date 2018-3-5

Chromosomal coordinates of Homo sapiens microRNAs

microRNAs: miRBase v22

genome-build-id: GRCh38

genome-build-accession: NCBI_Assembly:GCA_000001405.15

Thank you very much

lsumoy commented 5 years ago

This was done with mirBase 22.

Looking it up it seems the shorter form is the exact for miR-151b and the longer one is the exact for 151a. The fact that the shorter could be at the same time a add-3p:-3 of miR-151a leads to the confusion.

Beyond issues with files formats and data parsing to me this means that it is tricky to perform analysis grouping by isomiR class and that it makes all the sense to do it at the individual sequence (license plate) level. If not it represents a nightmare in the few cases were there is redundancy in terms of annotation. I suppose similar issue happen with alternative transcript isoform and splicing analysis in mRNA.

mlhack commented 5 years ago

Hi all, thanks for detecting this. it really seems that the labels are incorrectly switched. i will check and fix asap. Cheers, Michael

lpantano commented 5 years ago

Hi @lsumoy,

Just to follow up on your comment:

Yes, it is difficult to group by anything for these cases, that is the reason the format should give you enough information to do what it may best, like work with sequence levels, or even filter these cases.

It would be a good thing if you have somebody there to contribute in the code to create matrix with unique uid, where each sequence is represented only once and is labeling the sequences with all possible sources. If you are willing to add this feature, let me know, I can help with details!

Thanks!

mlhack commented 5 years ago

Hi all, i fixed this issue. now, when one sequence was assigned to more than one reference sequence with different isomiR labels, those are in the correct order. maybe one should try to guess the most likely scenario instead of making multiple labeling. for example, in miRGeneDB, there is no hsa-mir-151b (correct @BastianFromm ?) and therefore the short sequence might be likely a 3' length variant of hsa-mir-151a. another probability would be to check for the abundance of the reference sequences assigning the read only to the most abundant one. Michael

lpantano commented 5 years ago

Thank you for the fix.

All of these ideas sound good to me, we are trying to test some of them with simulated data. Using other more robust databases will help as well, as miRGeneDB for sure.

Cheers