Input data metrics explanation, % of mtDNA reads of the total sequence reads that mapped to the whole mtDNA_follow up

AliBasuony2022 commented 5 months ago

Dear Nicolas,

Just a follow up question for the issue #216

How do I know the right % of mtDNA reads of the total sequence reads that mapped to the whole mtDNA? I'm doing a comparison between the performance of NOVOPlasty and other de novo assemblies and this information is so important.

When I used adifferent memmory settings (all other settings are fixed), I have got the same lenght of the largest contig, but with differnt number for assembled reads, aligned and total reads.

Does the subsampled fraction: 99.99 % when setting the Max memory= Null is right? if so, the number of total reads is over the number of reads in the raw data. I'm still confused, sorry.

max memory Null log_mito_1_375_12_6_max memory Null.txt

max memory 100 log_mito_1_375_12_3_max memory 100.txt

memory 64 log_mito_1_375_12_max memory 64.txt

Kind regards, Ali

ndierckx commented 5 months ago

Not sure what your question is, the 99,99% is the amount of reads used, without the max memory it is usually 100%, but sometimes the reads are too short or have missing read pairs and then get discarded. So no subsampling was done in this case

Forward reads without pair: 41965 Reverse reads without pair: 17442

This means some reads had not matching pair...

How many reads are there in total, less than than 434743096?

AliBasuony2022 commented 5 months ago

Hi Nicolas,

raw data: 218,591,573 read pairs (adapter-free) trimmed data: 216,237,628 read pairs (adapter and low qulaity reads free).

Both raw and trimmed data produced the same largest contig,16521 bp, but % of organelle genome is 0.43 for raw and 0.44 for trimmed data. see below log files

log_mito_1_375_raw data.txt log_mito_1_375_trimmed data.txt

Is it expected that % of recovered organelle genome in case if NOVOPlasty overcomes that from reference-based ones (% 0.39)?

Below is the pipeline for reference-based approach:

Approaches 3 and 4: Reference-based read mapping was performed using two different parameter settings (see below), by aligning the trimmed sequencing reads against the V. vulpes reference genome (assembly version: GCF_003160815.1_VulVul2.2; Kukekova et al., 2018) using BWA-MEM v0.7.17 (Li and Durbin, 2009) with default parameters. We then used SAMtools v1.10 (Li et al., 2009) to obtain sorted bam files, followed by using GATK v4.2.2.0 (https://gatk.broadinstitute.org/hc/en-us) to remove PCR duplicates using MARKDUPLICATESSPARK and to filter out bad read mates, reads with mapping quality zero and reads which mapped ambiguously (Nater et al., 2017), and again SAMtools to extract the mitochondrial reads that mapped to the mtDNA scaffold (NC_008434, Arnason et al. 2006) of the reference genome. HAPLOTYPECALLER in GATK was used to call variants, using two different parameter settings, using as values for the flag --sample-ploidy: 1 for haploid (ploidy 1; Approach 3), and 2 for diploid (ploidy 2; Approach 4), each yielding a separate VCF file. Finally, FastaAlternateReferenceMaker from GATK was used to convert the two VCF files from Approaches 3 and 4 to FASTA format.

Kind regards, Ali

ndierckx commented 5 months ago

Do you have a problem with the 0.01 difference? :) If you subsample there will be some fluctuations and it is just an estimation. If your other method gave 0.39%, then I have to say the estimation is pretty good, what does it matter if it is 0.39 or 0.43 ...

AliBasuony2022 commented 5 months ago

No, I don't have a problem with this difference- I'm joust surperising how de novo (NOVOPlasty) produce more % of mtDNA reads than reference-based approach.

I'm doing a comparison between the different assemblies, and I should give some explecit recommondation for the performence of each method, hene any diffrence even if it's samll is importatnt.

I think, we must leave max memory empty, thus, it will not subsample and then use all the reads, is this right?

I apprecite your time and help.

Regards, Ali

ndierckx commented 5 months ago

It is a de novo assembler, I didn't put much effort in those statistics, I just added them because I thought they could be informative for the users, but it is still an estimation. Does the mt% have to be that accurate? Max memory setting is to reduce memory usage and increase the speed, the assembly will still be as good, because you have very high coverage. Maybe %mt will be a bit less accurate, but I thought nobody would care about that.. And in your case you have some small contigs, this could explain the elevated %, because the reads used for assembling those are also included and maybe they are not mitochondrial. Although it is more probable that the assembly is incomplete because of some repetitive region in the control region, in those cases you often see those small contigs. Other graph based assemblers will collapse that region, which seems that the assembly is complete but it isn't. Hence, a lot of mt genomes on NCBI are not fully assembled (collapsed repetitive regions)... You can send me your log files and contigs, I can have a look if this is the case

AliBasuony2022 commented 5 months ago

Thanks for this explanation, below are the log files

But, you didn't response to this question, yet. " I think, we must leave max memory empty, thus, it will not subsample and then use all the reads, is this right?"

log_mito_1_375_12_3_max memory 100.txt log_mito_1_375_12_max memory 64.txt log_mito_1_375_12_6_max memory Null.txt

ndierckx commented 5 months ago

If max memory is empty it doesn't subsample, but not sure what you mean by we must.., you can do whatever you prefer

And I need the extended log to see, but since you didn't use that option you can send me the Contigs fiel, probaply can see it from there

AliBasuony2022 commented 5 months ago

Ok, I meant by "must." to get use of all the reads, then the estimation of recovered mtDNA reads will be an accuarte than when you subsample.

Below is the extended log files

log_extended_mito_1_375_12_3_max memory 100.txt log_extended_mito_1_375_12_max memory 64.txt

log_extended_mito_1_375_12_6_max memory Null.txt

ndierckx / NOVOPlasty

Input data metrics explanation, % of mtDNA reads of the total sequence reads that mapped to the whole mtDNA_follow up #222