shahab-sarmashghi / RESPECT

Estimating repeat spectra and genome length from low-coverage genome skims
Other
11 stars 1 forks source link

High GS estimates #18

Open Banthandor opened 7 months ago

Banthandor commented 7 months ago

Dear Shahab,

First of all, many thanks for making this tool available. I really want it to work well because our data should fit the requirements for this method.

I have tried this on my data after subsampling, but the estimates seem quite high, and there is quite some variation. It varied from <700mb to 1100mb. I know the estimate will start approaching diploid genome size, so for I thought a high heterozygosity in this species might result in such variation. I then tried to replicate a sample from the paper, yet the estimate is still substantially higher than reported in the paper.

My output:

sample  input_type  sequence_type   coverage    genome_length   uniqueness_ratio    HCRM    sequencing_error_rate   average_read_length
SRR847379.unclassified.downsampled.1x.fastq.gz  sequence    genome-skim 1.32    604557431   0.39    652.84  0.0028  100.9895
SRR847379.unclassified.downsampled.2x.fastq.gz  sequence    genome-skim 2.60    613898358   0.38    646.75  0.0025  100.9891

Respect paper (supplementary data):

species accession duplication contamination Respect_error covest_error
Ceratitis capitata Mediterranean fruit fly SRR847379 436490799 10% 7% 17% -3% -40%

I preprocessed the data differently, but I did not see anything inherently wrong about my approach. So I don't understand why this estimate is so different. Below are the preprocessing treatments I have done:

fastp with following options:

--detect_adapter_for_pe --qualified_quality_phred 20 --length_required 100 --dedup --dup_calc_accuracy 6 --cut_tail

kraken2 for decontamination

kraken2 

    --db pluspf
    --confidence 0.1 
    --paired

Then unclassified reads of both forward and reverse orientation were put in a single file and used as input for RESPECT. Note that I ran this sample twice, once with 800Mb worth of reads and once with 1600Mb (double).

Do you have an idea of what can explain this difference? Many thanks in advance!

Kind regards, Sam

edit: it's maybe worth mentioning that kraken2 only classified and removed 3,7 percent but that is because I have filtered all reads below length 100 (about half), as that seems to be the average. Both pipelines seem to be about equally sensitive in contaminant detection. I will rerun the analysis later this week with less stringent length filtering and lower coverage as well.

Banthandor commented 7 months ago

I reran the same sample but only filtered out reads with length below 50bp. The results are now closer to the value you reported, but it is still very sensitive to coverage as you can see below. The estimates with lower coverage seem more realistic.

sample  input_type  sequence_type   coverage    genome_length   uniqueness_ratio    HCRM    sequencing_error_rate   average_read_length
SRR847379.unclassified.downsampled.1x.5000000reads.fastq.gz sequence    genome-skim 0.95    491992495   0.44    444.28  0.0048  93.9043
SRR847379.unclassified.downsampled.1x.6000000reads.fastq.gz sequence    genome-skim 1.74    321318985   0.67    455.01  0.0053  93.538
SRR847379.unclassified.downsampled.1x.4000000reads.fastq.gz sequence    genome-skim 0.78    481384762   0.44    451.15  0.0054  94.3476

preprocessing logs:

Read1 before filtering:
total reads: 57833139
total bases: 5841147039
Q20 bases: 5149339115(88.1563%)
Q30 bases: 4680648929(80.1324%)

Read2 before filtering:
total reads: 57833139
total bases: 5841147039
Q20 bases: 5353511683(91.6517%)
Q30 bases: 4891708669(83.7457%)

Read1 after filtering:
total reads: 53684554
total bases: 5032951723
Q20 bases: 4884699021(97.0544%)
Q30 bases: 4456230904(88.5411%)

Read2 after filtering:
total reads: 53684554
total bases: 5077175421
Q20 bases: 4963648582(97.764%)
Q30 bases: 4557936837(89.7731%)

Filtering result:
reads passed filter: 107876358
reads failed due to low quality: 2790314
reads failed due to too many N: 302934
reads failed due to too short: 4696672
reads with adapter trimmed: 3323773
bases trimmed due to adapters: 127278526

Duplication rate: 0.865901%

Insert size peak (evaluated by paired-end reads): 1

53684554 sequences (10110.13 Mbp) processed in 70.335s (45796.2 Kseq/m, 8624.55 Mbp/m).
  2424513 sequences classified (4.52%)
  51260041 sequences unclassified (95.48%)
shahab-sarmashghi commented 7 months ago

Hi Sam,

There could be a few factors that we can try here to see what is the source of discrepancy. The first thing that I noticed is that you haven't merged the forward and reverse reads which might cause some issues depending on the insert size and how much overlap there is between the read pairs. Since it seems you have high enough coverage data that you are downsampling, can you try running on read pairs separately and see if you get more consistent results? Also, I don't think we did any read length filtering in our preprocessing so unless you have specific reasons to be concerned about that in your data, I won't give too much weight to do that. Please let me know how it goes.

Best, Shahab