mourisl / Rascaf

Scaffolding with RNA-seq read alignment
20 stars 6 forks source link

validate and filter the connections with databases #2

Open melop opened 8 years ago

melop commented 8 years ago

Hello,

   In the manual it mentioned an optional step of validating proposed connections by mapping to a public protein database. I think this is really a great idea, but I don't seem to find the script for doing that in the repository.
  Is this function currently implemented or do the users have to come up with their own solutions?

Best Regards, Rongfeng Cui Max-Planck Institute for the Biology of Ageing.

mourisl commented 8 years ago

Hi Rongfeng,

The feature is kind of there. To do this, you need to run "rascaf" with "-cs" option to output the sequence supporting the connection of two contigs.

As a result, you will have a file look like "*_cs.fa". Then you can blast those sequences against refseq_rna database. If you don't have the blast database locally, you can use the "-remote" option. For example: "blastn -db refseq_rna -task dc-megablast -query rascaf_cs.fa -out blast.out -outfmt 6 -remote -max_target_seqs 20" This procedure is quite slow and assumes you have blastn on your machine.

After that you can run the "EvaluateBlastResults.pl" script as: "perl ./EvaluateBlastResults.pl blast.out rascaf_cs.fa". The output will show the validation for the sequence from rascaf_cs.fa file. 1: valided. 0: uncertain. -1: misorientation -2: misorder -3: chimeric -4: no match in the blast database.

You can then adjust the corresponding connection in the rascaf.out file based on the validation result.

Thanks, Li

melop commented 8 years ago

Dear Li,

Thank you for the quick reply!

For the last step, is there a way to automate the process? I could write a script for it in case you don't already have one.

Another question is why not use tblastn to make the alignment at the amino acid level?

Thanks Ray On Nov 28, 2016 8:58 PM, "Li Song" notifications@github.com wrote:

Hi Rongfeng,

The feature is kind of there. To do this, you need to run "rascaf" with "-cs" option to output the sequence supporting the connection of two contigs.

As a result, you will have a file look like "*_cs.fa". Then you can blast those sequences against refseq_rna database. If you don't have the blast database locally, you can use the "-remote" option. For example: "blastn -db nt -task dc-megablast -query rascaf_cs.fa -out blast.out -outfmt 6 -remote -max_target_seqs 20" This procedure is quite slow and assumes you have blastn on your machine.

After that you can run the "EvaluateBlastResults.pl" script as: "perl ./EvaluateBlastResults.pl blast.out rascaf_cs.fa". The output will show the validation for the sequence from rascaf_cs.fa file. 1: valided. 0: uncertain. -1: misorientation -2: misorder -3: chimeric -4: no match in the blast database.

You can then adjust the corresponding connection in the rascaf.out file based on the validation result.

Thanks, Li

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263377020, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwTB6WLOeovbvvSrLm55e6blEQ6Ayks5rCzJ3gaJpZM4K9yGt .

melop commented 8 years ago

Sorry I meant tblastx. On Nov 28, 2016 9:33 PM, "Ray Cui" platycerus@gmail.com wrote:

Dear Li,

Thank you for the quick reply!

For the last step, is there a way to automate the process? I could write a script for it in case you don't already have one.

Another question is why not use tblastn to make the alignment at the amino acid level?

Thanks Ray On Nov 28, 2016 8:58 PM, "Li Song" notifications@github.com wrote:

Hi Rongfeng,

The feature is kind of there. To do this, you need to run "rascaf" with "-cs" option to output the sequence supporting the connection of two contigs.

As a result, you will have a file look like "*_cs.fa". Then you can blast those sequences against refseq_rna database. If you don't have the blast database locally, you can use the "-remote" option. For example: "blastn -db nt -task dc-megablast -query rascaf_cs.fa -out blast.out -outfmt 6 -remote -max_target_seqs 20" This procedure is quite slow and assumes you have blastn on your machine.

After that you can run the "EvaluateBlastResults.pl" script as: "perl ./EvaluateBlastResults.pl blast.out rascaf_cs.fa". The output will show the validation for the sequence from rascaf_cs.fa file. 1: valided. 0: uncertain. -1: misorientation -2: misorder -3: chimeric -4: no match in the blast database.

You can then adjust the corresponding connection in the rascaf.out file based on the validation result.

Thanks, Li

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263377020, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwTB6WLOeovbvvSrLm55e6blEQ6Ayks5rCzJ3gaJpZM4K9yGt .

mourisl commented 8 years ago

For the last step, I'm sorry we don't have the script to remove the connections from rascaf.out now. If you can wait for 1 day, I can implement it now (this is on my todo list anyway).

I think tblastx would work as long as the output is the same as blastn (-outfmt 6).

Thanks, Li

melop commented 8 years ago

Hi Li,

    no problem, please take your time. Let me know when this script is

ready.

    If I want to use a protein database, like taxonomic specific

uniprot, is it also possible (with blastx -outfmt 6)? Best Regards, Ray

On Mon, Nov 28, 2016 at 9:45 PM, Li Song notifications@github.com wrote:

For the last step, I'm sorry we don't have the script to remove the connections from rascaf.out now. If you can wait for 1 day, I can implement it now (this is on my todo list anyway).

I think tblastx would work as long as the output is the same as blastn (-outfmt 6).

Thanks, Li

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263388955, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwQmm2w5e4FakGKt0RJ55a1B9g0JYks5rCz2CgaJpZM4K9yGt .

mourisl commented 8 years ago

I just tested blastx with one sequence and I believe the script will work fine with blastx.

melop commented 8 years ago

Ok, thank you for testing!

Ray On Nov 28, 2016 10:26 PM, "Li Song" notifications@github.com wrote:

I just tested blastx with one sequence and I believe the script will work fine with blastx.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263399295, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwUSjKUN_UczTq1RcLu22woQf-WTDks5rC0cBgaJpZM4K9yGt .

mourisl commented 8 years ago

I just uploaded the script "FilterRascafConnection.pl". You can use it after obtaining the validation results from blast. The commands is like: "perl FilterRascafConnection.pl blast_eval.out rascaf_cs.fa rascaf.out > rascaf_filter.out" .

Please let me know whether this works.

Thanks, Li

melop commented 8 years ago

Dear Li,

   thank you for your fast response. I will test this script.

   In your paper you have compared Rascaf to two other RNA scaffolders.

I am aware of an unpublished scaffolder called BESST_RNA. Out of curiosity I'm comparing results between them. So far it seems that BESST_RNA yielded a somewhat higher N50 (input N50: 815,196, BESST_RNA N50: 910,428, Rascaf N50: 848,421), and a slightly better BUSCO performance: (input complete BUSCOs: 4380, BESST_RNA: 4410, Rascaf: 4397). BESST_RNA requires merging of all bam files into a single input. Other than that I set a minimum link number to 4 for both programs. I'm wondering if you would have time to perform some systematic comparisons between the two regarding accuracy and performance.

Best Regards, Ray

On Tue, Nov 29, 2016 at 4:25 AM, Li Song notifications@github.com wrote:

I just uploaded the script "FilterRascafConnection.pl". You can use it after obtaining the validation results from blast. The commands is like: "perl FilterRascafConnection.pl blast_eval.out rascaf_cs.fa rascaf.out > rascaf_filter.out" .

Please let me know whether this works.

Thanks, Li

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263466183, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwRQbMltGttxeXU1IsBA96NW-iom9ks5rC5s7gaJpZM4K9yGt .

mourisl commented 8 years ago

Thanks for letting me know the comparison. I tried to run BESST_RNA, but it crashed. It is likely a problem about python on our sever. I will try it on a different computer later.

melop commented 8 years ago

Dear Li,

    thanks for looking at this.

Best Regards, Ray

On Tue, Nov 29, 2016 at 11:15 PM, Li Song notifications@github.com wrote:

Thanks for letting me know the comparison. I tried to run BESST_RNA, but it crashed. It is likely a problem about python on our sever. I will try it on a different computer later.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mourisl/Rascaf/issues/2#issuecomment-263717325, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwfEQgBgqBJKRiSYHENJhFFeXpGAWks5rDKP3gaJpZM4K9yGt .