miRTop / mirtop

command lines tool to annotate miRNAs with a standard mirna/isomir naming
https://mirtop.readthedocs.org
MIT License
18 stars 21 forks source link

ref_miRNA is annotated as isomiR for dme-miR-6-3p #64

Closed DrHogart closed 4 years ago

DrHogart commented 4 years ago

Hi, Thank you for the useful tool, it indeed helps me to proceed with my projects. Currently, I'm trying to use the ShortStack to explore Drosophila miRNAs with the same mature sequence, such as dme-miR-6-3p that is annotated for dme-mir-6-1, dme-mir-6-2 and dme-mir-6-3. For this, I've generated genomic BAM and then converted the coordinates to the pre-miRNA-based coordinates. Then I applied the mirtop to obtain isomir gff with converted BAM, and found that mature dme-mir-6-3p was wrongly attributed as isomiR type (iso_add3p:5) instead of ref_miRNA:

dme-mir-6-3 miRBasev22 isomiR 52 73 0 + . Read TATCACAGTGGCTGTTCTTTTT; UID iso-22-V6J4QWWZQ; Name dme-miR-6-3p;Parent dme-mir-6-3; Variant iso_add3p:5; Cigar 22M;Expression 691; Filter Pass; Hits 1;

The command that I've used is: mirtop gff -d --format BAM --sps dme --hairpin sample_hairpin_mir6.fa --gtf sample_mir6.gff3 --database db -o ./ sample_mir6.bam

Then I've changed the names of dme-miR-6-3p in different precursors such as dme-miR-6-1-3p, dme-miR-6-2-3p and dme-miR-6-3-3p in all .fa, .str and .gff3 files, but the result was the same.

All data are here: files and commands.

lpantano commented 4 years ago

Hi,

thanks for trying this out.

I can think right now of something to check: did you made the calculation of the precursors coordinates from the gff you are using: sample_mir6.gff3 ? if you pick up that miRNA and the different precursors, you can convert the genomic coordinates to precursor coordinates from this file and compare it to what you have in the BAM file, are they different?

DrHogart commented 4 years ago

I've checked the coordinates in all files for the dme-mir-6-3p and for the dme-mir-281-3p, which is also attributed to two pre-mirs but correctly recognized as ref_miRNA (btw, all others multiply attributed miRs -13b-3p, -2a-3p, -2b-3p, 983-3p and 983-5p - are correctly recognized as ref_miRNA). The results are below:

dme-miR-281-3p (dme-mir-281-1) dme-miR-281-3p (dme-mir-281-2) dme-miR-6-3p (dme-mir-6-1) dme-miR-6-3p (dme-mir-6-2) dme-miR-6-3p (dme-mir-6-3)
miRBase gff3 58-80 60-82 52-73 48-69 53-74
miRbase str 58-80 60-82 52-73 48-69 53-74
BAM 58-80 60-82 52-73 48-69 53-74
mirtop gff 57-79 59-81 51-72 47-68 52-73
mirtop result ref_miRNA ref_miRNA iso_add3p:5 iso_add3p:5 iso_add3p:5

So, as you can see, all the positions in mirbase gff3 and BAM are the same. BTW, why the start and end positions in the resulted mirtop gff are shifted -1 relative to the gff3 and BAM?

DrHogart commented 4 years ago

One another observation. When I applied the mirtop with --genomic mode with genomic BAM file with miR-6 and miR-281 reads I've obtained different result:

dme-mir-6-3 miRBasev22 isomiR 53 74 0 + . Read TATCACAGTGGCTGTTCTTTTT; UID iso-22-V6J4QWWZQ; Name dme-miR-6-3p;Parent dme-mir-6-3; Variant iso_3p:-1,iso_add3p:5,iso_snv,iso_5p:+1; Cigar I21M;Expression 691; Filter Pass; Hits 1;

dme-mir-281-2 miRBasev22 isomiR 60 82 0 + . Read TGTCATGGAATTGCTCTCTTTGT; UID iso-23-969BZI4Z0N; Name dme-miR-281-3p;Parent dme-mir-281-2; Variant iso_5p:+1,iso_snv; Cigar I22M;Expression 48; Filter Pass; Hits 1;

So, the coordinates are the same as in the mirbase gff but shifted +1 relative to mirtop result with pre-miRNA-based BAM and non-genomic mode or mirtop; also reads attributed as isomiR.

lpantano commented 4 years ago

Thanks for the detective work, can you tell me the version of mirtop you are using?

thanks

DrHogart commented 4 years ago

oops. mirtop 0.4.23

lpantano commented 4 years ago

Hey, As a quick check I saw that you may have same sequences under different names:

seq_21688_x691  0       dme-mir-6-3     53      255     22M     *       0       0       TATCACAGTGGCTGTTCTTTTT  IIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:22 NM:i:0  XM:i:4  XX:i:3 XY:Z:P   XZ:f:0.694      XC:Z:AAC        RX:Z:TACCCACGT  RG:Z:miRNA_AAC
seq_479_x104    0       dme-mir-6-2     48      255     22M     *       0       0       TATCACAGTGGCTGTTCTTTTT  IIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:22 NM:i:0  XM:i:4  XX:i:3 XY:Z:P   XZ:f:0.056      XC:Z:AAC        RX:Z:GCCTTTGAC  RG:Z:miRNA_AAC
seq_5915_x363   0       dme-mir-6-1     52      255     22M     *       0       0       TATCACAGTGGCTGTTCTTTTT  IIIIIIIIIIIIIIIIIIIIII  XA:i:0  MD:Z:22 NM:i:0  XM:i:4  XX:i:3 XY:Z:P   XZ:f:0.25       XC:Z:AAC        RX:Z:ACACTTGGT  RG:Z:miRNA_AAC
[

I don't have the fastq file used, can you check whether these three names has the same sequences? that could be an issue, that for sure I should add as an error. Can you check that?

thanks

lpantano commented 4 years ago

I found the bug, I need to make several test before uploading the patch, something for tomorrow I think.

DrHogart commented 4 years ago

Yes, they are identical and they should be there. I have generated genomic BAM file with ShortStack that attribute multiple mapped reads to different loci based on the local distribution of other uniquely-mapped reads. Then I collapsed the identical reads based on their position, and the resulted file converted to the precursor coordinates before mirtop running. Please note that there are several multi-mapped mature miRs, such as miR-281-3p, but all of them are correctly recognized as ref_miRNA.

чт, 20 февр. 2020 г., 18:37 Lorena Pantano notifications@github.com:

Hey, As a quick check I saw that you may have same sequences under different names:

seq_21688_x691 0 dme-mir-6-3 53 255 22M 0 0 TATCACAGTGGCTGTTCTTTTT IIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:22 NM:i:0 XM:i:4 XX:i:3 XY:Z:P XZ:f:0.694 XC:Z:AAC RX:Z:TACCCACGT RG:Z:miRNA_AAC seq_479_x104 0 dme-mir-6-2 48 255 22M 0 0 TATCACAGTGGCTGTTCTTTTT IIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:22 NM:i:0 XM:i:4 XX:i:3 XY:Z:P XZ:f:0.056 XC:Z:AAC RX:Z:GCCTTTGAC RG:Z:miRNA_AAC seq_5915_x363 0 dme-mir-6-1 52 255 22M * 0 0 TATCACAGTGGCTGTTCTTTTT IIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:22 NM:i:0 XM:i:4 XX:i:3 XY:Z:P XZ:f:0.25 XC:Z:AAC RX:Z:ACACTTGGT RG:Z:miRNA_AAC [

I don't have the fastq file used, can you check whether these three names has the same sequences? that could be an issue, that for sure I should add as an error. Can you check that?

thanks

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/miRTop/mirtop/issues/64?email_source=notifications&email_token=AAE67TTDSIA34YKU4OV5IZ3RD2PSNA5CNFSM4KYB2UCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMOYAHY#issuecomment-589135903, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE67TS2GQQMWF3HFZ4FHR3RD2PSNANCNFSM4KYB2UCA .

lpantano commented 4 years ago

The error is affecting only sequences ending in 6 or more A/T due to a bug. hopefully a minority :S

DrHogart commented 4 years ago

Wow, thanks for letting me know, this make sense. Looking forward to test the patched mirtop

чт, 20 февр. 2020 г., 19:37 Lorena Pantano notifications@github.com:

The error is affecting only sequences ending in 6 or more A/T due to a bug. hopefully a minority :S

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/miRTop/mirtop/issues/64?email_source=notifications&email_token=AAE67TXQJTQOE24UVCVANBLRD2WVNA5CNFSM4KYB2UCKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMPBEHY#issuecomment-589173279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE67TRJCK3KCP5S37QENJTRD2WVNANCNFSM4KYB2UCA .

DrHogart commented 4 years ago

Thanks Lorena, now it works.