ndierckx / NOVOPlasty

NOVOPlasty - The organelle assembler and heteroplasmy caller
Other
170 stars 62 forks source link

Input reads have incorrect file format #206

Open cebos opened 1 year ago

cebos commented 1 year ago

Hi Nicolas, I'm using novoplastty to de novo assemble mitochondrial genomes for a dataset of 29 sets of paired end short read data. For over a third of my samples, I've gotten the following error:

THE INPUT READS HAVE AN INCORRECT FILE FORMAT! PLEASE SEND ME THE ID STRUCTURE!

I've attached an example of some of the reads from one sample below, please let me know if there is / what other information you require. I filtered the raw data for the entire dataset with fastp using default parameters and am giving novoplasty the filtered forward and reverse read files generated by fastp. Thank you for your time, your help and advice is greatly appreciated! Best, cebos

Example: zless Microrhombophryne_Ca39_ZCMV-12404_L001_R1.out.fastq.gz

@J00138:141:HN23TBBXX:5:1101:23520:1068 2:N:0:ATAGCGAC+ATTACTCG CCCTGAATGTCTACGTGGCTCTTTGTTACTATAAACTTGATTACTATGATGTGTCACAGGAAGTTCTTGCAGTATATTTGCAACAGGTTCCTGACAGTACGATTGCTCTTAATTTGAAGGCCTGCAATCATTTTCGTCTTTACAATGGGAA + AAFFFJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFJJ--AFJFJJ7JF-F-FAJFFAFFJF7AJFJJFJ7-AFFJJJJJJJJAJJFJJJJJJJJJJJJJFFJJJJFJJJJJJJJJJFJJJJJJJJJJAJFAAAJJFJJJ7AJJJJJ<7 @J00138:141:HN23TBBXX:5:1101:30076:1068 2:N:0:ATAGCGAG+ATTACTCG GCCAACAAAAGGTATCGCCTTATTTCTTCACTTTTCTATTGAATTCAATGGCCAACACGGTACAACACATCACTGCTACATATCGAATAGATAGCTTGGCCGTAGGCCTGTGTGTTTGGGGAAGGGCTGATCAGAACCCATCGGGATAGCT + A<AFFJJJJJJJJJJJJFJJJJ-F7FFJJJJJF7J7FJF-<FJ<F7JJFA--7--AFF7-AFJJFFJ<7JFJJF7FFJJJFJA<FJJFFFJJJJJJJFJFJ7JF<FFFAFJ--AFJAFFFJJJJJJFJAF-F7FFJJJA7-))7-FFF<F< @J00138:141:HN23TBBXX:5:1101:18873:1103 2:N:0:ATAGCGAC+ATTACTCG TGGATACTGGAGAAGATTCGAGTGGTAGATTCTATTCAGAACCTTGGAGATGATCTCACTGCAGTCATGTCAATTCAGAGAAAACTCTGTGGCATTGAGAAAGATCTTGGTGCCATTGAGTCTAAACTTGTAAGTCTACAAGAAGAGGCAA + AAAFFJJJJJJFFJJJJJJJJJJJJJFJFFJJJJJJJJJJJFJJJJJJFJFJJJJAAJFFJJAJ7FJJJJJJJJJJJJJJJJJJJJJJJFJJJJFJJ-77AJJFJJFFJJJJJJJJJJJJFJJJFFJJFF<JJJJFF<JJJJJJJJJJJJJ @J00138:141:HN23TBBXX:5:1101:27965:1103 2:N:0:ATAGCGAG+ATTACTCG AGGTTGGCAATGTGGAATCAGGCAGAGTGTGCAATGGCAAGCAAGGTT + AAFFFJJJJJJJAFJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJFAF @J00138:141:HN23TBBXX:5:1101:25276:1121 2:N:0:ATAGCGAC+ATTACTCG

cebos commented 1 year ago

I also have an unrelated Novoplasty question, what is the function of the Optional config.txt parameters Insert Range = 1.9 and Insert Range strict = 1.3?

ndierckx commented 1 year ago

Hi, Although it are the forward reads, it has a 2 in the id: "2:N:0:" Shouldn't these be the reverse reads?

And why are you filtering the reads? You can just use the complete dataset.

Greets,

Nicolas

ndierckx commented 1 year ago

Insert range doesn't need to be changed, you can make it larger when you use a library that has very fluctuating insert ranges, but that is almost never the case

cebos commented 1 year ago

Hi Nicolas, Thanks for your prompt response! You were right, it appears some of the files were mislabeled as the opposite of either forward or reverse reads, they are working fine now. What's the difference between the default settings of

Optional: Insert size auto = yes Use Quality Scores = Output path = versus adding the Insert Range = 1.9 and Insert Range = 1.3 ?

I also have a question about the Store Hash = Yes option, I'm testing 1000s of seeds on my dataset to see if I can find optimal ones, if I enable this option then should that speed up the computational time? My understanding is that the hash table is created based on the read information, and the seed is applied after. However, once I added the Store Hash = Yes option to my script, it appears that Novoplasty is taking as long or longer to run my analyses than it did before hand. The slurm output for the run appears to show that a new hash table is being stored for each run and the output directory also has a new Hash file for each project run. Reading Input......OK Scan reference sequence......OK Building Hash Table......OK Subsampled fraction: 100.00 % Retrieve Seed...BUILD2

I've written a script to create a batch file for each individual that provides a new project name for the seed + sample combination, and the other standard information, so that this structure is iterated through the file (until all seed combinations have been included): Project_${sample}_${seed} ${seed_dir}${seed}.fasta ${input_dir}${sample}_R1.fq.gz ${input_dir}${sample}_R2.fq.gz

From the above, it appears that the Store Hash option stores a separate hash for each new project, even if the read data being provided is the same. Is there a way I can store the hash table to use across many projects? I want to compare contig lengths across seeds so it's important to keep the project naming conventions since that's how the ouput fasta files are named. Thanks a lot for your help!!

deyuanyang commented 1 year ago

Hi,

I also have the same question. If I changed the name of the projects, the seeds would not work.

ndierckx commented 1 year ago

@cebos

Insert size auto means that it will automatically calculate the insert size, the range determines how much it can differ from the insert size. No need to change anything there, won't change much.

About the store hash, have you read the wiki: https://github.com/ndierckx/NOVOPlasty/wiki/Store-hashes-locally

You need to run store hash only ones and then you need to use the stored hashes in stead of the reads. It will speed up the first phase by a lot, especially for larger datasets.

Why are you using 1000s of seeds, if you have a WGS dataset, one seed should be enough and the seed is only need to initiate the assembly and should be quite flexible

ndierckx commented 1 year ago

@deyuanyang

Not sure what you mean by the seeds won't work...

ndierckx commented 1 year ago

There is also a batch function: you can check the wiki

https://github.com/ndierckx/NOVOPlasty/wiki/Batch-function

It is easy to use and like this you can run many samples with the changes you want per run

cebos commented 1 year ago

@ndierckx Thanks for sharing the specific info on the store hash function. So, if I understand correctly, there is not a way to generate and store the hash within the same run (since the config file must first have Store Hash = yes and then Store Hash = path/to/hash/file? If possible, I would like to be able to generate the hash and then immediately call it for subsequent project runs within the same batch file, like so (incorporating a 5th line for the HASH_project.txt file): Project_${sample}_${seed} ${seed_dir}${seed}.fasta ${input_dir}${sample}_R1.fq.gz ${input_dir}${sample}_R2.fq.gz

Project_${sample}_${seed} ${seed_dir}${seed}.fasta ${output_dir}HASH2B_Project_${sample}_${seed}.txt ${output_dir}HASH2C_Project_${sample}_${seed}.txt ${output_dir}HASH_Project_${sample}_${seed}.txt I assume this isn't possible since two separate arguments are required in the config file to first generate the hash and then call it later. However, I can still use the batch function to call saved hashes, correct?

ndierckx commented 1 year ago

You can use the hash files directly for subsequent runs, because you will know how the hash file is called. You can use the bash mode for it too...