Open lsumoy opened 5 years ago
Hi,
thanks for reporting this. This may be fixed in the devel version. Can you provide with these two files:
You can create them to contain only these two sequences. I can test it here, and see what is going on.
As well, you can try to use the devel version yourself.
Thanks a lot for spotting this.
Cheers
Here are the specific entries for the example in the two sRNAbench output files:
microRNAannotation.txt lines: TCGAGGAGCTCACAGTCTAGT hsa-miR-151a-5p hsa-mir-151a exact - 1551 597.3388956586441 351.1619134905494 TCGAGGAGCTCACAGTCT hsa-miR-151a-5p$hsa-miR-151b hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3 - 169 65.08721687060661 38.263290380337104
reads.annotation.txt lines: TCGAGGAGCTCACAGTCTAGT 1551 351.1619134905494 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,31 1 TCGAGGAGCTCACAGTCT 169 38.263290380337104 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,28$mature#hsa-miR-151b#sense#hsa-mir-151b,60,77 2
sorry, another question:
another question, is this mirbase 21 or 22?
On Wed, Feb 20, 2019 at 8:18 AM lsumoy notifications@github.com wrote:
Here are the specific entries for the example in the two sRNAbench output files:
microRNAannotation.txt lines:
TCGAGGAGCTCACAGTCTAGT hsa-miR-151a-5p hsa-mir-151a exact - 1551 597.3388956586441 351.1619134905494 TCGAGGAGCTCACAGTCT hsa-miR-151a-5p$hsa-miR-151b hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3 - 169 65.08721687060661 38.263290380337104
reads.annotation.txt lines:
TCGAGGAGCTCACAGTCTAGT 1551 351.1619134905494 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,31 1 TCGAGGAGCTCACAGTCT 169 38.263290380337104 mature#sense mature#hsa-miR-151a-5p#sense#hsa-mir-151a,11,28$mature#hsa-miR-151b#sense#hsa-mir-151b,60,77 2
reads.annotation.txt https://github.com/miRTop/mirtop/files/2884720/reads.annotation.txt microRNAannotation.txt https://github.com/miRTop/mirtop/files/2884709/microRNAannotation.txt
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/miRTop/mirtop/issues/53#issuecomment-465569864, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HECb-0Aai86aWJa4dAh-pQeJ91tNks5vPUsIgaJpZM4bDglo .
Hi,
now that I am looking better to the sRNAbench output, I see that there is where the two reads are labeled as exact for hsa-miR-151a-5p. I assume the order is relevant in the columns indicating the miRNA and the variants.
So, for instance, read:
TCGAGGAGCTCACAGTCTAGT
says is hsa-mir-151a exact
and read:
TCGAGGAGCTCACAGTCT
says is hsa-mir-151a$hsa-mir-151b exact$lv3p|lv3pT|lv3p#-3
, and I assume the first variant type (exact
) is for hsa-mir-151a
and the second (lv3p
) for hsa-mir-151b
.
mirtop
is only parsing that. I assume the variants should be switched, it would be worthy to have the opinion of @mlhack on this.
Cheers
Hi Lorena, I reply here to your first question.
It is version 22 for mirBase in this case.
gff-version 3
date 2018-3-5
Chromosomal coordinates of Homo sapiens microRNAs
microRNAs: miRBase v22
genome-build-id: GRCh38
genome-build-accession: NCBI_Assembly:GCA_000001405.15
Thank you very much
This was done with mirBase 22.
Looking it up it seems the shorter form is the exact for miR-151b and the longer one is the exact for 151a. The fact that the shorter could be at the same time a add-3p:-3 of miR-151a leads to the confusion.
Beyond issues with files formats and data parsing to me this means that it is tricky to perform analysis grouping by isomiR class and that it makes all the sense to do it at the individual sequence (license plate) level. If not it represents a nightmare in the few cases were there is redundancy in terms of annotation. I suppose similar issue happen with alternative transcript isoform and splicing analysis in mRNA.
Hi all, thanks for detecting this. it really seems that the labels are incorrectly switched. i will check and fix asap. Cheers, Michael
Hi @lsumoy,
Just to follow up on your comment:
Yes, it is difficult to group by anything for these cases, that is the reason the format should give you enough information to do what it may best, like work with sequence levels, or even filter these cases.
It would be a good thing if you have somebody there to contribute in the code to create matrix with unique uid
, where each sequence is represented only once and is labeling the sequences with all possible sources. If you are willing to add this feature, let me know, I can help with details!
Thanks!
Hi all, i fixed this issue. now, when one sequence was assigned to more than one reference sequence with different isomiR labels, those are in the correct order. maybe one should try to guess the most likely scenario instead of making multiple labeling. for example, in miRGeneDB, there is no hsa-mir-151b (correct @BastianFromm ?) and therefore the short sequence might be likely a 3' length variant of hsa-mir-151a. another probability would be to check for the abundance of the reference sequences assigning the read only to the most abundant one. Michael
Thank you for the fix.
All of these ideas sound good to me, we are trying to test some of them with simulated data. Using other more robust databases will help as well, as miRGeneDB for sure.
Cheers
Hi,
We recently detected that gff output form mirtop stats gives unexpected results. We got two entries for canonical (variant=NA for many miRNAs instead of a single one (one of the two in fact is a n add-3p=-1 isoform but the miRTop generated gff tags it as NA (canonical).
In many instances wo results come out annotated as 'NA? (i.e. canonical or mature reference in miRBase) in our parsed output:
example: in the case of miR-151a-5p http://www.mirbase.org/cgi-bin/mature.pl?mature_acc=MIMAT0004697 the canonical reference lecenseplate ID is PhkUD0w
9589 isomiR HC_BST_RA_L001_10 hsa-mir-151a hsa-miR-151a-5p PhkUD0 TCGAGGAGCTCACAGTCT
9593 isomiR HC_BST_RA_L001_10 hsa-mir-151a hsa-miR-151a-5p PhkUD0w TCGAGGAGCTCACAGTCTAGT
expression
9589 169
9593 1551
Looking at gff files generated by mirtop (please see highlighted (**) entries that have NA as variant):
grep -w 'PhkUD0' HC_BST_RA_L001_10.gff hsa-mir-151a miRBasev22 isomiR 10 29 . + . Read=TCGAGGAGCTCACAGTCTA; UID=PhkUD0@2; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_3p:+1; Cigar=19M; Expression=14; Filter=Pass; Hits=2 hsa-mir-151a miRBasev22 ref_miRNA 10 28 . + . Read=TCGAGGAGCTCACAGTCT; UID=PhkUD0; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=NA; Cigar=18M; Expression=169; Filter=Pass; Hits=2 hsa-mir-151b miRBasev22 isomiR 59 78 . + . Read=TCGAGGAGCTCACAGTCTA; UID=PhkUD0@2; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_3p:-2; Cigar=19M; Expression=14; Filter=Pass; Hits=2 hsa-mir-151b miRBasev22 isomiR 59 77 . + . Read=TCGAGGAGCTCACAGTCT; UID=PhkUD0; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_3p:-3; Cigar=18M; Expression=169; Filter=Pass; Hits=2 hsa-mir-151b miRBasev22 isomiR 59 80 . + . Read=TCGAGGAGCTCACAGTCTAAA; UID=PhkUD0@; Name=hsa-miR-151b; Parent=hsa-mir-151b; Variant=iso_add:+2; Cigar=19MAM; Expression=1; Filter=Pass; Hits=1
grep -w 'PhkUD0w' HC_BST_RA_L001_10.gff hsa-mir-151a miRBasev22 isomiR 10 32 . + . Read=TCGAGGAGCTCACAGTCTAGTA; UID=PhkUD0w@2; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_3p:+1; Cigar=22M; Expression=657; Filter=Pass; Hits=1 hsa-mir-151a miRBasev22 isomiR 10 33 . + . Read=TCGAGGAGCTCACAGTCTAGTAA; UID=PhkUD0w@1; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_add:+1; Cigar=22MA; Expression=56; Filter=Pass; Hits=1 hsa-mir-151a miRBasev22 isomiR 10 34 . + . Read=TCGAGGAGCTCACAGTCTAGTAAA; UID=PhkUD0w@; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=iso_add:+2; Cigar=22MAA; Expression=2; Filter=Pass; Hits=1 hsa-mir-151a miRBasev22 ref_miRNA 10 31 . + . Read=TCGAGGAGCTCACAGTCTAGT; UID=PhkUD0w; Name=hsa-miR-151a-5p; Parent=hsa-mir-151a; Variant=NA; Cigar=21M; Expression=1551; Filter=Pass; Hits=1
Steps to reproduce the problem.
/imppc/labs/lslab/jsanchez/python_environment/mirtop_jsanchez/bin/mirtop gff --sps hsa --hairpin /imppc/labs/lslab/aluna/opt/sRNAtoolboxDB/libs/hairpin.fa --gtf /imppc/labs/lslab/aluna/opt/hsa.gff3 --format srnabench -o /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10 /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.1.isomiR_sRNAbenchtoolbox/HC_BST_RA_L001_10 2> /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10_logfile.txt
/imppc/labs/lslab/jsanchez/python_environment/mirtop_jsanchez/bin/mirtop stats -o /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10/stats /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10/HC_BST_RA_L001_10.gff 2>> /imppc/labs/lslab/share/data/proc_data/20190208_LungCancer_Serum_small_RNAseq/2.Analysis_20190208/LungCancer_Serum_PE/6.2.isomiR_miRTop/HC_BST_RA_L001_10_logfile.txt
Specifications like the version of the project, operating system, or hardware.
We are running this on: mirtop (0.3.17) python2.7 debian8.10 linux