mhalushka / miRge3.0

Comprehensive analysis of small RNA sequencing data
MIT License
30 stars 11 forks source link

SNP fasta file (human_mirna_SNP_pseudo_miRBase.fa) #56

Closed taeyoungh closed 2 weeks ago

taeyoungh commented 2 years ago

Hi, I am wondering how to generate "human_mirna_SNP_pseudo_miRBase.fa" in "fasta.Libs" directory. It is quite different from "human_mature_miRBase.fa" as many miRNA ids are different between them in addition to SNP suffix. For example, hsa-miR-5190-5p and hsa-miR-5190-3p are found in "human_mirna_SNP_pseudo_miRBase.fa" while no suffix id (hsa-miR-5190) is only found in "human_mature_miRBase.fa". There are many other examples like this. Could you explain the details? Could you let us have a code to generate this file?

arunhpatil commented 2 years ago

Hi @taeyoungh ,

I think what you are looking for is SNP annotation in miRBase library, is that correct? Before I could answer your question, I would like to mention that we are revising/updating libraries including MirGeneDB. Now, concering the suffixes, the database annotations are not entirely reflected in SNP_pseudo and also, some miRNAs have this discrepencies when miRBase updated there versions individually without any evident change log. These are few things we are fixing now along with software updates. I hope this helps.

Having said that, please let us know what exactly you are looking for, since the SNP_pseudo file doesn't reflect the SNPs for reporting miRNA changes (interms of counts and RPM), this file is used to report A2I editing. If you are interested in incorporating SNPs in the miRBAse for alignment and annotations, then you need to edit the index files. Let me know if this is what you are looking for and I can assist you in having your own custom SNPs.

Thank you, Arun.

taeyoungh commented 2 years ago

Hi @arunhpatil , I am asking about this file because it seems that this file was used to generate the bowtie index. For example, when I looked into the bowtie index using bowtie-inspect, I found "hsa-miR-5190-3p". This miRNA was also found in "human_mirna_SNP_pseudo_miRBase.fa" but not in "human_mature_miRBase.fa". Instead, "human_mature_miRBase.fa" has a record of "hsa-miR-5190". I guess that you somehow add "-3p" and "-5p" to "hsa-miR-5190" in the generation of this file. Am I understood correctly? I want to understand how you generated bowtie index file.

Also, I have a question about genomic coordinates in bowtie index. The output of bowtie-inspect provides a genomic coordinate for every miRNA. For example, there is a record for "hsa-miR-1973 chr8 segs:1-21 cds:+:76202058-76202078" in the header of bowtie index. But I cannot find this miRNA in the gff3 file in the annotation folder. In this case, how did you assign the genomic coordinates to this kind of miRNA?

Thanks for your help!

arunhpatil commented 2 years ago

@taeyoungh,

Thank you for pointing out these IDs. I will have to get back to you on these questions. With regard to the coordinates, I derive it from GFF file, which in this caes seems otherwise. This is very helpful to consider and troubleshoot logical errors. I very much appreciate you bringing this to our attention. I will get back to you shortly on this one.

Thank you, Arun.

arunhpatil commented 2 years ago

Hi @taeyoungh ,

The additional 5p or 3p miRNAs were added based on our previous detection of miRNA reads in the genomic loci of the annotated pir miRNAs from miRBase. These reads were part of Toward the human cellular microRNAome study and since then, we have retained these passenger miRNAs as part of our miRge library.

A small percent (394, 0.7%) were identified in more than 50 samples. Additionally, 207 were the unassigned “passenger” 5p or 3p microRNAs from a known microRNA locus, and 15 were orthologous to a different species’microRNA (primarily primate) (Supplemental Tables S8, S9).

Regarding, coordinates, I believe it is an error and I will correct them soon. Once again thank you for bring this to our attention.

I hope this is helpful.

Thank you, Arun.

565755044 commented 1 year ago

Hello, I mainly use the - ai module in your software. I want to use this module to identify miRNA editing sites in Sus_scrofa. I use the" miRge build "provided by you to create a new library, but I don't know how to customize the methods for two files, such as" human_mirna_SNP_pseudo_miRBase.fasta, human_miRNAs_in_repetitive_element_MirGeneDB. csv ". What are the rules for creating these two files? I am very looking forward to your reply. thank you.

arunhpatil commented 1 year ago

Hi @565755044,

These files has to be created manually, the rules of this is described in the miRge paper. You can download the repetitive elements from UCSC genome browser (Select ->Tools -> Table browser and under group select Repeats for appropriate genome assembly), and miRNAs overlapping these repeate elements should be recorded in the csv file as miRNA name followed by repeate element name.

For example: Hsa-Mir-28-P1_5p*,gene_id "L2c"; transcript_id "L2c_dup8856"; This miR-28 overlaps with L2c genomic coordinates.

Note the rules are as such, for repeats, you should have miR name seperated by comma followed by gene_id and transcript_id seperated by semicolon. For SNPs, you should have the canonical miRNA named (header) as _SNPC and any alternative mature sequence with a SNP should be denoted by _A,_B, _D etc suffixes in the FASTA file. Also, this FASTA file should be indexed (i.e., bowtie-build this new FASTA file).

For example: >Hsa-Mir-28-P1_3p.SNPC (This is canonical miRNA sequence hence _C) CACTAGATTGTGAGCTCCTGGA >Hsa-Mir-28-P1_3p.SNPA CACTAGATTGTGAGTTCCTGGA

I hope this is helpful, @mhalushka, can add if I have missed anything.

Thank you, Arun.

565755044 commented 1 year ago

Hello, Thank you for your reply. For SNP, I know that SNPC is a canonical miRNA, but for example, SNPA or SNPB, I do not know how you obtained these SNPs with high sequence similarity, and there are no mature miRNA sequences.  There are also duplicate components. How is this judged? Because the repeating element has the following format: chr1 hg38 rmsk exon 67108754 67109046 1892.000000 + . gene id "L1P5"; transcript_ id "L1P5"。 How to know the starting bp position in the specific information that can only be opened for each miRNA to determine? This is too much work! I look forward to hearing from you.

------------------ 原始邮件 ------------------ 发件人: "mhalushka/miRge3.0" @.>; 发送时间: 2023年3月24日(星期五) 凌晨5:49 @.>; @.**@.>; 主题: Re: [mhalushka/miRge3.0] SNP fasta file (human_mirna_SNP_pseudo_miRBase.fa) (Issue #56)

Hi @565755044,

These files has to be created manually, the rules of this is described in the miRge paper. You can download the repetitive elements from UCSC genome browser (Select ->Tools -> Table browser and under group select Repeats for appropriate genome assembly), and miRNAs overlapping these repeate elements should be recorded in the csv file as miRNA name followed by repeate element name.

For example: Hsa-Mir-28-P1_5p*,gene_id "L2c"; transcript_id "L2c_dup8856"; This miR-28 overlaps with L2c genomic coordinates.

Note the rules are as such, for repeats, you should have miR name seperated by comma followed by gene_id and transcript_id seperated by semicolon. For SNPs, you should have the canonical miRNA named (header) as _SNPC and any alternative mature sequence with a SNP should be denoted by _A,_B, _D etc suffixes in the FASTA file. Also, this FASTA file should be indexed (i.e., bowtie-build this new FASTA file).

For example: >Hsa-Mir-28-P1_3p.SNPC (This is canonical miRNA sequence hence _C) CACTAGATTGTGAGCTCCTGGA >Hsa-Mir-28-P1_3p.SNPA CACTAGATTGTGAGTTCCTGGA

I hope this is helpful, @mhalushka, can add if I have missed anything.

Thank you, Arun.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

arunhpatil commented 1 year ago

Hi @565755044,

Do you know of any repository for Sus scrofa with known SNPs, if not, can you try copying your miRBase generaged FASTA file as a SNP fasta file (shown below)?

You can replace human to Sus scrofa database you have. cp human_mature_miRBase.fa human_mirna_SNP_pseudo_miRBase.fa

This will trick the software but you will find A-to-I editing hits, you can then see what are the most abundant editings you find and check if they are true positives. Also, don't worry about the repeats for A-to-I (As far as I remember they are not connected).

I hope this helps.

Thank you. Arun

565755044 commented 1 year ago

Hello Arun, Thank you for your reply. I used your method to replace human data with my own Sus scrofa database, and output some editing sites, but there are still few. I want to start with the source code and the root of the problem. I try to parse the file' mirge2 TRF a2i.py'. When the sample is compared with the reference genome, I want to try to intercept the positions of editing sites in the reference genome in the "A2IEditing" or "a2i_editing" custom function, and then compare the position information with the known mirnas in the reference genome to determine whether there are really only these editing sites. Can you help me judge the feasibility of this method?  Thank you.

lv

------------------ 原始邮件 ------------------ 发件人: "Arun @.>; 发送时间: 2023年3月28日(星期二) 上午9:03 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [mhalushka/miRge3.0] SNP fasta file (human_mirna_SNP_pseudo_miRBase.fa) (Issue #56)

Hi @565755044,

Do you know of any repository for Sus scrofa with known SNPs, if not, can you try copying your miRBase generaged FASTA file as a SNP fasta file (shown below)?

You can replace human to Sus scrofa database you have. cp human_mature_miRBase.fa human_mirna_SNP_pseudo_miRBase.fa

This will trick the software but you will find A-to-I editing hits, you can then see what are the most abundant editings you find and check if they are true positives. Also, don't worry about the repeats for A-to-I (As far as I remember they are not connected).

I hope this helps.

Thank you. Arun

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

arunhpatil commented 1 year ago

Hi @565755044,

There may not be enough editing cites, can you share your findings. Modifying the source code may take time, instead, if you can state what you are aiming at, I can help figure out an alternate method.

Thank you, Arun.

565755044 commented 1 year ago

Hi Arun,My problem is that there are very few miRNA editing sites identified by using pig small RNA sequencing data. I want to start with the source code, only get the return of the position information of the edited site in the reference genome after comparing with the reference genome, then compare the known miRNA sequence with the reference genome to get the same position information, and then compare it with the position of the edited site in the initial test to verify whether the edited site obtained by the software is true or not, and annotate the position of these edited sites in the miRNA sequence relative to myself. Because I only identified a few editing sites. At present, I output the information in the' retainedSeqContentDicTmp' dictionary as a text document, which is normal: ['aaaaactgagacttttg','+','Chr 10',' 110988977',' aaaaactgagacttt',' iiiiiiiiiiiiiiiiii',''. output_text = pprint.pformat(retainedSeqContentDicTmp with open('output.txt', 'w') as file: file.write(output_text), The output is like this: acagcaggcacagacaggca [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], from which I can't get the position information of the editing site on the reference genome. Please help me to see the feasibility of this method, or provide some suggestions for me to get fewer editing sites.  Thank you.

lv.

------------------ 原始邮件 ------------------ 发件人: "Arun @.>; 发送时间: 2023年4月26日(星期三) 上午6:40 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [mhalushka/miRge3.0] SNP fasta file (human_mirna_SNP_pseudo_miRBase.fa) (Issue #56)

Hi @565755044,

There may not be enough editing cites, can you share your findings. Modifying the source code may take time, instead, if you can state what you are aiming at, I can help figure out an alternate method.

Thank you, Arun.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>