miRNA annotation with seqcluster from BAM file

lpantano / seqcluster

small RNA analysis from NGS data

http://seqcluster.readthedocs.io

MIT License

35 stars 17 forks source link

miRNA annotation with seqcluster from BAM file #27

Closed svm-zhang closed 6 years ago

svm-zhang commented 6 years ago

Hello @lpantano,

I am a bit new to miRNA annotation, and I was trying to run seqbuster subcommand seqbuster on my BAM that was aligned by STAR. My command line was pretty standard following the doc:

/bcbio_anaconda_bin/seqcluster seqbuster --hairpin [hairpin.hsa.fa] --mirna [mirna.str] -sps hsa -o test [MY_BAM]

The error I got is "global name os is not defined" in the function _sam_to_bam.py within the init.py file .

So I went into that file and found you imported os.path as op. Simply changing this part didn't fix the problem. I further looked into this function, and the way I understand it is that if the input is with a "sam" suffix it will convert it to a "bam". But why in the if statement, it evaluates whether it ends with "bam"? I updated that to "sam", and this part passed. However, there was a new error from pysam.sort(). Please see the attached file. I feel it might be caused by pysam version? error.txt

I also came across another error when I did not specify --hairpin and --mirna options, the program tried to download the two files, but ended with complaining about "gzip !$.gz". This can be avoided by downloading hairpin.fa and miRNA.str myself. I think you might want to know.

So how can I fix the first problem? I also saw the a GTF option, do I need to prepare that file? If so, any suggestion on how should I prepare it? Thanks very much in advance for your help.

happy holidays!

Simo

lpantano commented 6 years ago

Hi,

sorry for the issues. We are migrating this to:

https://github.com/miRTop/mirtop

The devel branch is the one that will work better. I recommend to use this one if you try to use this functionality.

You can download the devel branch and instal locally with the bcbio python setup.py install.

You will need the hairpin.fa file and the GTF downloaded both from mirbase.

That should produce the output we are supporting now.

However depend on what you want to do after, you may want to use miraligner tool (installed with bcbio) to get the miRNA annotation and the use the isomiRs package to analyze the data.

If you tell me more I can give the best tips for you.

Cheers

sent not from my computer

On Dec 28, 2017, at 13:06, Simo V Zhang notifications@github.com wrote:

Hello @lpantano,

I am a bit new to miRNA annotation, and I was trying to run seqbuster subcommand seqbuster on my BAM that was aligned by STAR. My command line was pretty standard following the doc:

/bcbio_anaconda_bin/seqcluster seqbuster --hairpin [hairpin.hsa.fa] --mirna [mirna.str] -sps hsa -o test [MY_BAM]

The error I got is "global name os is not defined" in the function _sam_to_bam.py within the init.py file .

So I went into that file and found you imported os.path as op. Simply changing this part didn't fix the problem. I further looked into this function, and the way I understand it is that if the input is with a "sam" suffix it will convert it to a "bam". But why in the if statement, it evaluates whether it ends with "bam"? I updated that to "sam", and this part passed. However, there was a new error from pysam.sort(). Please see the attached file. I feel it might be caused by pysam version? error.txt

I also came across another error when I did not specify --hairpin and --mirna options, the program tried to download the two files, but ended with complaining about "gzip !$.gz". This can be avoided by downloading hairpin.fa and miRNA.str myself. I think you might want to know.

So how can I fix the first problem? I also saw the a GTF option, do I need to prepare that file? If so, any suggestion on how should I prepare it? Thanks very much in advance for your help.

happy holidays!

Simo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

svm-zhang commented 6 years ago

Hello @lpantano,

Thanks very much for the quick response.

Just to confirm how I should install mirtop with my current bcbio. I first git clone the dev branch mirtop, and then put it in the bcbio installation dir, and install from there? Could you please give more specific instructions?
miRBase only provides a GFF3 file, should I use this as value for the GTF option?
As for my downstream analyses, I'd like to quantify miRNA expression and do differential expression across my samples.

Your tips would be very helpful. Thanks in advance. Happy holiday!

cheers, Simo

lpantano commented 6 years ago

Hi,

I would then use mialigner directly to quantify and isomiRs R package to load the data and then work in the differential expression analysis.

Since you have bcbio installed, any reason you don’t want to use the small RNA pipeline already comes with it. It generates everything you need to work with the isomiRs package. That is the easiest way to get everything ready.

If not, you can work with this pipeline:

http://seqcluster.readthedocs.io/mirna_annotation.html#processing-of-reads (from there until miraligner section)

mirtop will be the default in the next months, but still we are working to make it total compatible to work with tools for downstream analysis. The way to installed is to clone the dev branch and then use the python binary from bcbio as: bcbio_python setup.py install. It should be ready to use then.

Thanks for looking into it. Happy to help more if you don’t get working the bcbio small RNA pipeline or the command-by-command tutorial.

Cheers

On Dec 29, 2017, at 11:59 PM, Simo V Zhang notifications@github.com wrote:

Hello @lpantano https://github.com/lpantano,

Thanks very much for the quick response.

Just to confirm how I should install mirtop with my current bcbio. I first git clone the dev branch mirtop, and then put it in the bcbio installation dir, and install from there? Could you please give more specific instructions?

miRBase only provides a GFF3 file, should I use this as value for the GTF option?

As for my downstream analyses, I'd like to quantify miRNA expression and do differential expression across my samples.

Your tips would be very helpful. Thanks in advance. Happy holiday!

cheers, Simo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/27#issuecomment-354527767, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HN53BHAuhlAan2mxxMkqSyN4lfolks5tFcMpgaJpZM4ROiGl.

svm-zhang commented 6 years ago

Hello @lpantano,

Thanks very much for your follow up on my problems.

For the last several days, I have been using the step-by-step pipeline as you mentioned at http://seqcluster.readthedocs.io/mirna_annotation.html#processing-of-reads. Let me give you a sense of the sample I am using. It is a sample contains only one miRNA (aly-MIR163). I have 8,356,240 reads after trimming adapter and polyA, and I aligned them using miraligner. Below is the result:

Number of reads to be mapped: 8356240 Searching in precursors Tue Jan 02 11:40:05 PST 2018 Num reads annotated: 13488

The command line I used is:

java -jar miraligner.jar -db [path to alyrata DB] -sub 1 -trim 3 -add 3 -s aly -o [out prefix] -i [fastq file]

As you can see, there are lots of reads that did not find miRNA 163 hits, if I understand correctly here. If I simply grep the miRNA 163 sequence GAAGAGGACUUGGAACUCGAUC in the same fastq file and allow for 1 mismatch, there are about 7.9 million reads have it (https://github.com/Wikinaut/agrep).

So my first question is how can I recover those reads unmapped by miraligner? And now I have the [prefix].mirna file. What should I do next to get the count file?

Last, I am not quite familiar with bcbio_nextgen. Could you please give me a head start on how to run it? I found its small RNA pipeline, http://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq. How should I prepare the config file? A example config file would be really helpful.

Apologize that I have so many questions. Thanks again for your help.

cheers, Simo

lpantano commented 6 years ago

Hi,

That is weird, normally after adapter removal and collapsing reads, one don’t see that many million reads. Normally is a problem in any of those steps. Can you send me all the commands you used before the miraligner step?

The unmapped reads should be in one of the other files generated by the program.

Once you have the .mirna file you can follow this:

http://bioconductor.org/packages/release/bioc/vignettes/isomiRs/inst/doc/isomiRs-intro.pdf http://bioconductor.org/packages/release/bioc/vignettes/isomiRs/inst/doc/isomiRs-intro.pdf

To run with bcbio, this is an example of the config file:

https://github.com/chapmanb/bcbio-nextgen/blob/master/config/templates/illumina-srnaseq.yaml https://github.com/chapmanb/bcbio-nextgen/blob/master/config/templates/illumina-srnaseq.yaml

The other config file needed is explained here:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#automated-sample-configuration

You have all the options explained here:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#smallrna-sequencing

https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq

https://bcbio-nextgen.readthedocs.io/en/latest/contents/outputs.html#small-rna-seq https://bcbio-nextgen.readthedocs.io/en/latest/contents/outputs.html#small-rna-seq

I know is a lot to read, but once you have run it with bcbio, it is super easy to reproduce and add more samples. And the chances to have an error are lower.

Cheers

On Jan 2, 2018, at 2:55 PM, Simo V Zhang notifications@github.com wrote:

Hello @lpantano https://github.com/lpantano,

Thanks very much for your follow up on my problems.

For the last several days, I have been using the step-by-step pipeline as you mentioned at http://seqcluster.readthedocs.io/mirna_annotation.html#processing-of-reads http://seqcluster.readthedocs.io/mirna_annotation.html#processing-of-reads. Let me give you a sense of the sample I am using. It is a sample contains only one miRNA (aly-MIR163). I have 8,356,240 reads after trimming adapter and polyA, and I aligned them using miraligner. Below is the result:

Number of reads to be mapped: 8356240 Searching in precursors Tue Jan 02 11:40:05 PST 2018 Num reads annotated: 13488

The command line I used is:

java -jar miraligner.jar -db [path to alyrata DB] -sub 1 -trim 3 -add 3 -s aly -o [out prefix] -i [fastq file]

As you can see, there are lots of reads that did not find miRNA 163 hits, if I understand correctly here. If I simply grep the miRNA 163 sequence GAAGAGGACUUGGAACUCGAUC in the same fastq file and allow for 1 mismatch, there are about 7.9 million reads have it (https://github.com/Wikinaut/agrep) https://github.com/Wikinaut/agrep.

So my first question is how can I recover those reads unmapped by miraligner? And now I have the [prefix].mirna file. What should I do next to get the count file?

Last, I am not quite familiar with bcbio_nextgen. Could you please give me a head start on how to run it? I found its small RNA pipeline, http://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq http://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#smallrna-seq. How should I prepare the config file? A example config file would be really helpful.

Apologize that I have so many questions. Thanks again for your help.

cheers, Simo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/27#issuecomment-354860428, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HIOGH3_3X_BEXhmExn_9xEcE2qBpks5tGonAgaJpZM4ROiGl.

svm-zhang commented 6 years ago

Hello @lpantano,

I used cutadapt to do adapter-trimming:

--overlap 10 --minimum-length 16 --nextseq-trim 20 --trim-n --max-n 10 -a [3' ADAPTER] -a A{8}N{30} -g [5' ADAPTER] --info-file [info-file] -o [output fastq] [input fastq]

The number I gave you was from this non-collapsed fastq.

I did collapse the adapter-trimmed reads:

seqcluster collapse -f [adapter-trimmed fq] -o [collapsed fq]

And run miraligner on the collapsed fastq using miraligner with -freq option, and I got similar results:

Number of reads to be mapped: 34922 Searching in precursors Num reads annotated: 336

If you have some insights what are reasons here, please let me know.

At the same time, I will run the bcbio pipeline. Thanks very much for all the links. Very helpful!

cheers, Simo

svm-zhang commented 6 years ago

Hello @lpantano,

How can I prepare my A. lyrata genome, transcriptome, and miRBase data? I found this tutorial very confusing. I was trying bcbio_setup_genome.py and it complains sam_fa_indices.loc file. What format that file should be? I guess I also need bwa, bowtie2, and star loc files, if I want to build their indices. If this is too much trouble to set up, I think I would stick to the step-by-step version.

Thanks for any help.

Simo

lpantano commented 6 years ago

Hi,

thanks for the details. It is weird the low number of mapped reads.

I can only think about two issues:

1- did you converted all the U into T of the hairpin.fa file? 2 - there is still nucleotides from the adapter. I would do the perfect grep of your sequence in the collapsed reads and see if you see nucleotides at the beginning. I would say if you have more than 2 nt in any side then is the reason is not mapping to miRNA.

I hope this help.(I’ll try to help with bcbio on Monday)

Cheers

sent not from my computer

On Jan 5, 2018, at 19:56, Simo V Zhang notifications@github.com wrote:

Hello @lpantano,

How can I prepare my A. lyrata genome, transcriptome, and miRBase data? I found this tutorial very confusing. I was trying bcbio_setup_genome.py and it complains sam_fa_indices.loc file. What format that file should be? I guess I also need bwa, bowtie2, and star loc files, if I want to build their indices. If this is too much trouble to set up, I think I would stick to the step-by-step version.

Thanks for any help.

Simo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lpantano commented 6 years ago

I use this command to prepare the hairpin.fa

zcat hairpin.fa.gz | awk '{if ($0~/>hsa/){name=$0; print name} else if ($0~/^>/){name=0};if (name!=0 && $0!~/^>/){print $0;}}' | sed 's/U/T/g' > hairpin.fa

if nothing works I can take a look if you can share with me the collapsed file.

On Jan 5, 2018, at 19:56, Simo V Zhang notifications@github.com wrote:

Hello @lpantano,

How can I prepare my A. lyrata genome, transcriptome, and miRBase data? I found this tutorial very confusing. I was trying bcbio_setup_genome.py and it complains sam_fa_indices.loc file. What format that file should be? I guess I also need bwa, bowtie2, and star loc files, if I want to build their indices. If this is too much trouble to set up, I think I would stick to the step-by-step version.

Thanks for any help.

Simo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

svm-zhang commented 6 years ago

Hello @lpantano,

Thanks again.

It looks like those collapsed reads who have exact match with the sequence of miRNA 163 have residual nucleotides at either or both ends. But adapter trimming can never be perfect, and in the real world this is an usual case, no? Does miraligner has some functionality like soft-clipping? Also, is it possible that miraligner could allow more than 1 mismatches, although it could cause false alignments?

Simo

lpantano commented 6 years ago

Hi,

It is not exactly normal. Do those residual nts map to the precursor of the miRNA that is in the hairpin.fa. If not, is not normal, and you need to figure out the exact adapters to remove. What protocol did u use? Is possible that are always the same number of nts as residuals?

If you can send me a couple of example I can look at that.

Cheers

sent not from my computer

On Jan 7, 2018, at 00:54, Simo V Zhang notifications@github.com wrote:

Hello @lpantano,

Thanks again.

It looks like those collapsed reads who have exact match with the sequence of miRNA 163 have residual nucleotides at either or both ends. But adapter trimming can never be perfect, and in the real world this is an usual case, no? Does miraligner has some functionality like soft-clipping? Also, is it possible that miraligner could allow more than 1 mismatches, although it could cause false alignments?

Simo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lpantano commented 6 years ago

Hi again,

Now I came back from holidays so I can help more. I am super curious about your case.

In one of your emails you say you get 34922 from the collapsed file. How many of those make through the grep command using your miRNA sequence?

Did you check whether the residual sequences of some of those match the precursor sequence?

In general, almost all tools won’t do soft-clipping, many are based on bowtie. So Something else is going on there, for sure.

Is this sample a real biological sample, and you are only interesting on that miRNA? It was done with small RNA library protocol?

It is very weird is happening this. I would say I have never had any problem with samples to find miRNAs if they come from small RNA library protocols.

Sorry to not being able to help more.

Cheers

On Jan 7, 2018, at 12:54 AM, Simo V Zhang notifications@github.com wrote:

Hello @lpantano https://github.com/lpantano,

Thanks again.

It looks like those collapsed reads who have exact match with the sequence of miRNA 163 have residual nucleotides at either or both ends. But adapter trimming can never be perfect, and in the real world this is an usual case, no? Does miraligner has some functionality like soft-clipping? Also, is it possible that miraligner could allow more than 1 mismatches, although it could cause false alignments?

Simo

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lpantano/seqcluster/issues/27#issuecomment-355801774, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HF1uVpjtoaXAZaGvdIFSSeGNoJHrks5tIFwPgaJpZM4ROiGl.