marbl / harvest

Other
51 stars 11 forks source link

Question about parsnp #6

Open yunyuy opened 9 years ago

yunyuy commented 9 years ago

Hi Treangen,

Thank you for the software. I am new to bioinformatics and programming. I apologize if I asked stupid questions. I have three issues when I ran Parsnp.

  1. the step of "determining repetitive regions..." did not show up in my output, comparing with you tutorial examples and other peoples output that I found online.
  2. I am missing .vcf file in my output directory contents. But I can export via gingr.
  3. When I force to include all the genomes in my data director (13 fasta files), I got a lot un-alinged regions, with 87% coverage of the alignment. I tried different -C parameters(1000, 10000, and 100000). The coverage seems not improved.

I am using Ubuntu linux OS, and linked Parsnp to ~/bin, using: ln -fs ~/bifido/harvest/src/Parsnp-Linux64-v1.2/parsnp ~/bin I don't know if the linking thing matters.

Thank you for your time and help! Let me know if any other information I should provide.

treangen commented 9 years ago

Hello,

I have three issues when I ran Parsnp.

ok, happy to clarify, please see below:

the step of "determining repetitive regions..." did not show up in my output, comparing with you tutorial examples and other peoples output that I found online.

This is expected. This step, as of the most recent release, is no longer enabled. This can instead be accomplished by importing a bed formatted file (see http://bedtools.readthedocs.org/en/latest/content/general-usage.html#bed-format) containing repetitive regions (or other regions you'd like to filter) via:

$>harvesttools -i parsnp.ggr -b parsnp.bed -S out.snps -o parsnp.*filt*.ggr

This command will import the bed formatted regions to filter and output a multi-fasta file (out.snps) containing only the SNP columns passing the incorporated & imported filters.

I am missing .vcf file in my output directory contents. But I can export via gingr.

This is expect, and as you correctly note, it can be exported via Gingr and/or be readily extracted via:

$>harvesttools -i parsnp.*filt*.ggr -V out.vcf

When I force to include all the genomes in my data director (13 fasta files), I got a lot un-alinged regions, with 87% coverage of the alignment.

Depending on multiple factors (including: gene loss/gain, highly polymorphic regions, highly fragmented draft genome assemblies, highly repetitive regions in genome, currently limitations in sensitivity) this is likely the expected result your dataset when running Parsnp. It is a core genome alignment method, and very strictly so; locally collinear blocks must contain sequence from 100% of the genomes. If your goal is to maximize alignment coverage of a multiple genome alignment, I'd recommend looking at existing multiple whole genome alignment tools, such as: Mauve (http://darlinglab.org/mauve/mauve.html) or Mugsy (http://mugsy.sourceforge.net).

I don't know if the linking thing matters.

This should be perfectly fine.

Thank you for your time and help!

sure thing, let me know if you run into any further issues.

-Todd

yunyuy commented 9 years ago

Thank you very much! Your quick response is a great support to my work!

-Yun