Subsampled fraction: 100.00 % & INVALID SEED

ndierckx / NOVOPlasty

NOVOPlasty - The organelle assembler and heteroplasmy caller

Other

174 stars 63 forks source link

Subsampled fraction: 100.00 % & INVALID SEED #146

Closed OmonkeyGOD closed 3 years ago

OmonkeyGOD commented 3 years ago

Hi Nicolas,

I am using the newest 4.2 version to do the assembly. After trying several different seeds, I keep getting this information even though I use the reads from the fastq file as the seed. Then I lower the kmer to 23, and I got a contig with a length of 280. I am wondering if this means my data was trimmed? And I should use the untreated data. My data are pair-end fastq files with the size of 3Gb for both forward and reverse files. The reference genome is around 12 Mb. This is the file headline:

Please advise. Thank you very much!

ndierckx commented 3 years ago

Hi,

What are trying to assemble and do you have enough coverage? Maybe add also the log file

OmonkeyGOD commented 3 years ago

Hi~

I use a mating locus gene as the seed because I want to check the mating type. I did this before with a different species using NP. And it gave me good results.

This time with the default parameters, I could not retrieve any reads. So I lower the kmer to 23. However, the assembled contig matches Oryza sativa Indica, which is not the expected species. My sample should be a fungus. I attach the log file for reference.

The quality of the sequence is terrible. The read length is 150 bp. The genome size is about 12 Mb. For both forward and reverse sequencing, there are 8740355 reads. However, only around 15% of the reads can be mapped to the reference genome using BWA mem.

I also tried to assemble the whole genome with Spades and blasted the assembled contigs to the NCBI 16S RNA database. It turns out there are hits from many different species. I am so confused now. Could it be sample contamination?

log_An_3.txt

ndierckx commented 3 years ago

I think you have a problem with your sample. If you often get INVALID SEED error, it usually means there are almost no similar sequences in the dataset. And have you checked your mapping, because 15% doesn't say much, because it will also align reads from different species. If you check the alignment distribution and there are a lot of gaps, I don't think your species has any data in it.

OmonkeyGOD commented 3 years ago

I checked the depth of the bam file. Yes, there are a lot of several-hundred bp gaps. Only about one-fourth of the genome has reads mapped, and in these regions, the coverages are so different. There is likely no information in the data. Thank you very much for your advice. Screen Shot 2020-09-12 at 11 19 46 AM

ndierckx commented 3 years ago

Seems you don't have your fungus in it then... Greets