Different assembly analysis sharing the same genome length

dtomas1989 commented 4 years ago

Dear all,

I am trying to assemble several varieties of the same specie but all the assemblies obtained have the same genome size in all varieties. All assemblies show a different nucleotide sequences (according to the variety used as input) but all share the exact same length. I think it is very strange ...

Anyone know why? I can't understand it ... I am using the following configuration file, only changing the Forward & Reverse reads in each analysis:

Project name = Novo Type = chloro Genome Range = 140768-160768 K-mer = 39 Max memory = 16 Extended log = Save assembled reads = yes Seed Input = cp_reference.fa Extend seed directly = yes Reference sequence = cp_reference.fa Variance detection = Chloroplast sequence =

Dataset 1:

Read Length = 150 Insert size = 312 Platform = illumina Single/Paired = PE Combined reads = Forward reads = reads_1.tr.fq.gz Reverse reads = reads_2.tr.fq.gz

Optional:

Insert size auto = yes Insert Range = 1.9 Insert Range strict = 1.3 Use Quality Scores = no

Maybe, Should I change or add any parameters?

Thank you so much in advance

ndierckx commented 4 years ago

Which is the length? Maybe show one of the logs

ndierckx commented 4 years ago

Oww I see, it s because you use the extend seed directly option and you use the complete reference genome as a seed. You can't do that, the seed should be short and extend seed directly only to use when you have a sequence already assembled of the same dataset. Just try the seed that I provided and say no to extend seed directly

dtomas1989 commented 4 years ago

Thank you very much for you quickly answer. Now, I understand it much better, I'm going to try that. Thank you so much again

dtomas1989 commented 4 years ago

Hi @ndierckx ,

I've performed a new anlysis using the seed that you provided us and "Extend seed directly = no" (as you told me) but my results are very bad, my assembly is extremely small now... Could you give me any advice? or should I change any additional parameters?, my log file is the following:

NOVOPlasty: The Organelle Assembler Version 3.8.1 Author: Nicolas Dierckxsens, (c) 2015-2019

Input parameters from the configuration file: Verify if everything is correct

Project:

Project name = Novo Type = chloro Genome range = 140768-160768 K-mer = 39 Max memory = 16 Extended log = Save assembled reads = yes Seed Input = /software/NOVOPlasty/Seed_RUBP_cp.fasta Extend seed directly = no Reference sequence = /genomes/my_cp_reference.fa Variance detection = Chloroplast sequence =

Dataset 1:

Read Length = 150 Insert size = 312 Platform = illumina Single/Paired = PE Combined reads = Forward reads = reads_1.tr.fq.gz Reverse reads = reads_2.tr.fq.gz

Heteroplasmy:

Heteroplasmy = HP exclude list = PCR-free =

Optional:

Insert size auto = yes Insert range = 1.9 Insert range strict = 1.3 Use Quality Scores =

Reading Input......OK

Scan reference sequence......OK

Building Hash Table......OK

Subsampled fraction: 5.87 % Forward reads without pair: 2137559 Reverse reads without pair: 1069181

Retrieve Seed......OK

Initial read retrieved successfully: GTTGTTGTAAGAATTCTTAATTCATGAGTTGTAGGGAGGGATTTATGTCACCACAAACAGAGACTAAAGCAGGTGTTGGATTCAAAGCGGGTGTTAAAGAGTACAAATTGACTTATTATACTCCTGAATACGAAACCAAAGATACTGATA

Start Assembly...

------------Assembly 1 finished: Contigs are automatically merged in Merged_contigs file------------

Total contigs : 0 Largest contig : 152 bp Smallest contig : 152 bp Average insert size : 312 bp

-----------------------------------------Input data metrics-----------------------------------------

Total reads : 28478388 Aligned reads : 1474 Assembled reads : 70

Thank you for using NOVOPlasty!

Thank you very much in advance, Best regards

ndierckx commented 4 years ago

Subsampled fraction: 5.87 % Forward reads without pair: 2137559 Reverse reads without pair: 1069181

Seems something wrong with the id structure

Can you send me the the first 10 lines or so from your forward and reverse read files. can do it with the head command

dtomas1989 commented 4 years ago

Yes, of course, find them below and thank you so much in advance.

zcat reads_1.tr.fq.gz | head

@A00564:174:HNLH5DSXX:3:1101:1163:1000 1:N:0:CGAACTTA+CATACCAA TGAATCCTCTCCTTGGGGCGGATTTGACATTTTGACCAGTTTCAACAGCATTTTCATCTCA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @A00564:174:HNLH5DSXX:3:1101:1235:1000 1:N:0:CGAACTTA+CATACCAA CAAAACCTACTTGTTTTTCCAGATAGTAATTGAGTCTTAGTAGAGGATAGACACTTTGTGTTTGCCTTTCCCGTGTTCGATATCCGGTACTAACCTTTAGCTATACTATATATACTCTGTATACTTGCAGGTTTATTTAGTGCTAATAAA + FFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FF,FFFFFFFFFFFFFF @A00564:174:HNLH5DSXX:3:1101:1289:1000 1:N:0:CGAACTTA+CATACCAA CAAAACCTACTTGTTTTTCCAGATAGTAATTGAGTCTTAGTAGAGGATAGACACTTTGTGTTTGCCTTTCCCGTGTTCGATATCCGGTACTAACCTTTAGCTATACTATATATACTCTGTATACTTGCAGGTTTATTTAGTGCTAATAAA

zcat reads_2.tr.fq.gz | head

@A00564:174:HNLH5DSXX:3:1101:1163:1000 2:N:0:CGAACTTA+CATACCAA ATCAAAGAACTTTTTCATTCAGCTCATAATCAATTT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF @A00564:174:HNLH5DSXX:3:1101:1235:1000 2:N:0:CGAACTTA+CATACCAA CCTTTTTCACTTCTTATTTTCCTATGAGTTCTTCAAGTTCCAAATCGAGCGGTTCTAGGCTGTCCTTCCCAGAGCGCGTTCGCATACAAAACCTGCAAGAACCAAAAATAAAAACGAATTAAAAACTAAAATTAAAAAATGAAAGCAATT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFF:F:FFFFF:,F,,FFFFFFFFFFFFFFFFFFF:F::FFFFFFFF:FFFFFFFF:F,FF:FFFFFFFFF:FFFFF @A00564:174:HNLH5DSXX:3:1101:1289:1000 2:N:0:CGAACTTA+CATACCAA CCTTTTTCACTTCTTATTTTCCTATGAGTTCTTCAAGTTCCAAATCGAGCGGTTCTAGGCTGTCCTTCCCAGAGCGCGTTCGCATACAAAACCTGCAAGAACCAAAAATAAAAACGAATTAAAAACTAAAATTAAAAAATGAAAGCAATT

ndierckx commented 4 years ago

why are some reads so short? Have you done quality trimming?

dtomas1989 commented 4 years ago

Yes, I've performed a trimmed step in this case but the result looks similar using untrimmed reads (raw reads).

ndierckx commented 4 years ago

Hi, I tried those reads, it should have no problem recognizing them but there seems to a problem somewhere.

Subsampled fraction: 5.87 % Forward reads without pair: 2137559 Reverse reads without pair: 1069181

This means you only used 6 % of your dataset, I guess because it is very large.. But last two lines should be 0, this means a lot of reads where not able to be linked to the paired read, which is weird. I don't have your data and don't know anything about it so hard to say

And never quality trim, it just removes lots of data, good for nothing

ndierckx commented 4 years ago

Hi,

Maybe you had some problems with the read IDs, I just uploaded version 4.0 , I fixed for another user a similar problem so maybe it works now for you too

ndierckx / NOVOPlasty