sanger-pathogens / circlator

A tool to circularize genome assemblies
http://sanger-pathogens.github.io/circlator/
Other
232 stars 59 forks source link

Error running circlator 0.14.0 to reproduce results from published article #54

Closed sajvanderzeeuw closed 8 years ago

sajvanderzeeuw commented 8 years ago

Dear Author,

I am trying to reproduce the results published in your article. After installing circlator, and all the other dependencies i get a error:

 /exports/sasc/sajvanderzeeuw/circlator-0.14.0/build/scripts-3.4/circlator all hgap_assembly.fasta corrected_reads.fasta circlator_0.14                                                                                                              Traceback (most recent call last):
  File "/home/sajvanderzeeuw/.virtualenvs/circlator/lib/python3.4/site-packages/pyfastaq-3.11.1-py3.4.egg/pyfastaq/utils.py", line 21, in open_file_read
FileNotFoundError: [Errno 2] No such file or directory: '03.assemble/contigs.fastg'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/exports/sasc/sajvanderzeeuw/circlator-0.14.0/build/scripts-3.4/circlator", line 52, in <module>
    exec('circlator.tasks.' + task + '.run()')
  File "<string>", line 1, in <module>
  File "/home/sajvanderzeeuw/.virtualenvs/circlator/lib/python3.4/site-packages/circlator-0.14.0-py3.4.egg/circlator/tasks/all.py", line 138, in run
    m.run()
  File "/home/sajvanderzeeuw/.virtualenvs/circlator/lib/python3.4/site-packages/circlator-0.14.0-py3.4.egg/circlator/merge.py", line 681, in run
    self._circularise_contigs(nucmer_hits)
  File "/home/sajvanderzeeuw/.virtualenvs/circlator/lib/python3.4/site-packages/circlator-0.14.0-py3.4.egg/circlator/merge.py", line 318, in _circularise_contigs
    called_as_circular_by_spades = self._get_spades_circular_nodes(reassembly_fastg)
  File "/home/sajvanderzeeuw/.virtualenvs/circlator/lib/python3.4/site-packages/circlator-0.14.0-py3.4.egg/circlator/merge.py", line 609, in _get_spades_circular_nodes
    names = set([x.id.rstrip(';') for x in seq_reader if ':' in x.id])
  File "/home/sajvanderzeeuw/.virtualenvs/circlator/lib/python3.4/site-packages/circlator-0.14.0-py3.4.egg/circlator/merge.py", line 609, in <listcomp>
    names = set([x.id.rstrip(';') for x in seq_reader if ':' in x.id])
  File "/home/sajvanderzeeuw/.virtualenvs/circlator/lib/python3.4/site-packages/pyfastaq-3.11.1-py3.4.egg/pyfastaq/sequences.py", line 35, in file_reader
  File "/home/sajvanderzeeuw/.virtualenvs/circlator/lib/python3.4/site-packages/pyfastaq-3.11.1-py3.4.egg/pyfastaq/utils.py", line 23, in open_file_read
pyfastaq.utils.Error: Error opening for reading file '03.assemble/contigs.fastg'

This is off course a very obvious error because the tools expect a file which contains a wrong name. This is a known bug in your 0.14.0 release?

Also i tried 1 of the newest releases 1.1.3. But here i cant reproduce your plasmodium results. It gives me that contig 96 is not circular.

See the folowing diff:

diff 04.merge.circularise.log ../test_circular_with_dnaAfile/04.merge.circularise.log
1a2
> [merge circularised]  unitig_97|quiver        0       0       0       0
3d3
< [merge circularised]  unitig_97|quiver        0       0       1       1

When i check if all dependencies are there it says all test succeeded. This is the point i'm stuck at.

Basically the question i have is:

How can i reproduce the findings of your published article: http://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0849-0

And which options are nice to take into account for circularizing genomes?

Thanks in advance!

Best regards,

Sander

martinghunt commented 8 years ago

Hi Sander,

What version of spades are you using? The missing file error was a known issue - please see https://github.com/sanger-pathogens/circlator/issues/39 If you want to use Circlator version 0.14.0, then you'll need SPAdes version 3.6.0 or lower. For the paper we used 3.5.0.

Re the plasmodium data, we used the option --assemble_spades_k 101 becuase the reads were low coverage.

As for options, that probably depends on your dataset. We explored varying most of the options in the paper (see the section "Evaluation of user-defined parameters").

Best, Martin

sajvanderzeeuw commented 8 years ago

Hi @martinghunt,

Thanks for replying so fast. I've checked your test data from Plasmodium and managed to reproduce even with the newest version!

I have one other question if i may ask. What is in your terms low coverage? 10 till 50 X?

Thanks, Best,

Sander

martinghunt commented 8 years ago

Hi Sander,

Thanks for confirming that you reproduced the results. That's always good to hear :)

I just had a quick look at a few of the NCTC samples compared with the plasmodium. I just made rough histograms of the read coverage from running samtools depth -aa. Typical NCTC example is:

screen shot 2016-04-15 at 09 38 12

Bear in mind this is corrected read coverage (ie what Circlator takes), not uncorrected reads.

Martin

sajvanderzeeuw commented 8 years ago

Thanks Martin!

Sorry but i think i made a mistake in my comparison, i thought the 2 outputs of my run vs you test data was the same but i missed some. I run it with the --assemble_spades_k 101 . Circlator all hgap_assembly.fasta corrected_reads.fasta circlator_k_101

06.fixstart.log from testDATA:

[contig break finder] id        break_point     gene_name       gene_reversed   new_name        skipped
[contig break finder] unitig_96|quiver  16443   prodigal        no      -       -
[contig break finder] unitig_97|quiver  2906    prodigal        yes     -       -

06.fixstart.log from My run:

[contig break finder]   id      break_point     gene_name       gene_reversed   new_name        skipped
[contig break finder]   unitig_96|quiver        16443   prodigal        no      -       -
[contig break finder]   unitig_97|quiver        2906    prodigal        no      -       -

I could only find one difference in the other logs and that was the clean.log

05.clean.log YOURS

[clean] unitig_96|quiver        user_kept
[clean] unitig_97|quiver        user_kept

05.clean.log MINE

[contig cleanup] contig_id      clean_log
[contig cleanup] unitig_96|quiver skipped
[contig cleanup] unitig_97|quiver skipped

What can be the cause of this? Thanks in advance!

SPAdes version : 3.5.0 Mummer: 3.23 Circlator: 0.14.0 Prodigal: 2.60 BWA: 0.7.12 Samtools: 0.1.9

martinghunt commented 8 years ago

Apologies for the confusion - in actual fact Circlator 0.14.1 was used. I will need to send an erratum to get the manuscript corrected.

Using 0.14.1 will mean that the 05.clean.logs will be identical.

For the 06.fixstart.log, this part of circlator uses code from here: https://github.com/sanger-pathogens/bio_assembly_refinement In the paper we used 0.3.2. You've probably got 0.5.0 installed.

sajvanderzeeuw commented 8 years ago

No problem, i installed everything according to your story, but i still do not manage to reproduce the results.

These are the packages i now have installed.

pip freeze
bio-assembly-refinement==0.3.2
circlator==0.14.1
Cython==0.24
et-xmlfile==1.0.1
jdcal==1.2
nose==1.3.7
numpy==1.11.0
openpyxl==2.4.0a1
pbr==1.9.1
pyfastaq==3.5.0
pymummer==0.4.0
pysam==0.8.1
six==1.10.0
stevedore==1.12.0
virtualenv==15.0.1
virtualenv-clone==0.2.6
virtualenvwrapper==4.7.1

Here you can download the Plasmodium data i use, just to be sure that i have the same as what you distribute.

https://barmsijs.lumc.nl/szeeuw/Plasmodium.tar.gz

In here all runs i tried are stored.

The circlator dir is your run circlator_0.14.1.fixed is the latest run i did.

Yours 06.fixstart

[contig break finder] id    break_point gene_name   gene_reversed   new_name    skipped
[contig break finder] unitig_96|quiver  16443   prodigal    no  -   -
[contig break finder] unitig_97|quiver  2906    prodigal    yes -   -

My 06.fixstart

[contig break finder] id    break_point gene_name   gene_reversed   new_name    skipped
[contig break finder] unitig_96|quiver  16443   prodigal    no  -   -
[contig break finder] unitig_97|quiver  16227   prodigal    no  -   -

Again any help will be strongly appreciated.

martinghunt commented 8 years ago

That pins down the remaining difference to fixstart, which is essentially bio_assembly_improvement. The contigs obviously have no match to a dnaa gene, so prodigal gets run to pick a gene near the centre of each contig and use that as a break point.

I'm wondering if that part of the code is deterministic. @nds can you comment please?

sajvanderzeeuw commented 8 years ago

@martinghunt Which version of bio_assembly_improvement i should have installed, to get this to work? Or is this related to the version of Prodigal. Hope you can still provide me an answer on this. Thanks in advance!

martinghunt commented 8 years ago

Hi,

Sorry for the slow reply, took some digging to figure out what's going on. It's actually earlier on at the merge stage where things are different, not bio_assembly_refinement. Depending on code versions, unitig_97 doesn't always get circularized succesfully. It should be ~6kb long and then the prodigal gene is at 2906. When the circularization goes wrong, it ends up around 30kb long and then the prodigal gene is at 16227.

I just ran circlator 0.14.1 with spades 3.5.0 on my original data, and on the data in your tarball and I get the same results as the paper. The important file here is the circularize log:

$ cat 04.merge.circularise.log [merge circularised] #Contig repetitive_deleted circl_using_nucmer circl_using_spades circularised [merge circularised] unitig_96|quiver 0 1 0 1 [merge circularised] unitig_97|quiver 0 0 1 1

In a failed version, unitig_97 doesn't get circularized using the circular call from spades: [merge circularised] #Contig repetitive_deleted circl_using_nucmer circl_using_spades circularised [merge circularised] unitig_96|quiver 0 1 0 1 [merge circularised] unitig_97|quiver 0 1 0 1

The only differences I see python module in versions is: pysam 0.8.3 pymummer 0.7.1 pyfastaq 3.12.1 which shouldn't make any difference. The only idea I have is that your circlator script is picking up a different version of the circlator module.