nanoporetech / duplex-tools

Splitting of sequence reads by internal adapter sequence search
Other
50 stars 6 forks source link

pairs.txt file empty, but pairs_from_bam/pair_ids.txt not empty #35

Open avilella opened 1 year ago

avilella commented 1 year ago

See below:

LS4:/data2/RUN/RUN091/20230203_1010_MN41147_ANY366_3647ace0/fastq_all$ duplex_tools pair *.sam > pairs.txt                                                                                      [10:53:04 - FindPairs] Duplex tools version: 0.3.0 
[10:53:04 - FindPairs] Creating seqsummary from bam                                                                                                                                                      24324it [00:01, 13652.28it/s]  
[10:53:05 - FindPairs] Calculating metrics.                                                                                                                                                              [10:53:05 - FindPairs] No alignment information found for validation.
[10:53:05 - FindPairs] Classifying pairs.                                                                                                                                                                [10:53:05 - FindPairs] 24242 pairs after filtering on duration between reads. (max 200000 s)
[10:53:05 - FindPairs] 4852 pairs after filtering on relative sequence length difference. (max 10.0% difference)                                                                                         [10:53:05 - FindPairs] 4851 pairs after absolute sequence length filtering. (max 5000 bp)
[10:53:05 - FindPairs] 4495 pairs after qscore filtering. (min qscore = 6)                                                                                                                               [10:53:05 - FindPairs] Found 4495 pairs within 24324 reads. (37.0% of reads are part of a pair).
[10:53:05 - FindPairs] Values above 100% are allowed since reads can be either template or complement                                                                                             
[10:53:05 - FindPairs] Writing files into pairs_from_bam directory                                                                                                                                       
[10:53:05 - FiltPairs] Duplex tools version: 0.3.0   
[10:53:05 - FiltPairs] Filtering Parameters:                                                        
        nbases_per_strand_to_align=250                                                              
        align_threshold=0.6
[10:53:05 - FiltPairs] Alignment Parameters:                                                        
        score_match=2                                                                               
        score_mismatch=-1                                                                                                                                                                                        penalty_open=4                                                                                                                                                                                   
        penalty_extend:1                                                                                                                                                                                 [10:53:07 - ReadFastq] Processed 0/1 input fastq/bam files.                                                                                                                                              
[10:53:07 - ReadFastq] Found 100.0% of required reads.
[10:53:07 - FiltPairs] Starting alignments.                                                         
[10:53:07 - AlignPairs] Aligning 4495 pairs                                                         
[10:53:07 - AlignPairs] Good pairs: 1799
[10:53:09 - Pair] Initial reads: 24324
[10:53:09 - Pair] Created pairs: 1799
[10:53:09 - Pair] Paired reads:  3598
[10:53:09 - Pair] Approximate duplex rate for fast5_all.sup.v4.0.0.wmoves.sam: 14.79%

The results are:

[          0 Feb  6 10:53]  ./pairs.txt
[       4096 Feb  6 10:53]  ./pairs_from_bam
[     332630 Feb  6 10:53]  ./pairs_from_bam/pair_ids.txt
[    1858079 Feb  6 10:53]  ./pairs_from_bam/pair_stats.txt
[     133126 Feb  6 10:53]  ./pairs_from_bam/pair_ids_filtered.txt
[     358563 Feb  6 10:53]  ./pairs_from_bam/pair_ids_scored.csv
[    2661432 Feb  6 10:53]  ./pairs_from_bam/read_segments.pkl

Could it be that the way the command was run, it outputs the pairs in the pairs_from_bam/pair_ids.txt file, rather than STDOUT?

Thanks

ollenordesjo commented 1 year ago

Hi @avilella, sorry for the delay in replying

Yes, that's correct, by default it will go to what the --output_dir is set to. The reason for this is to keep the associated files together (both the paired ids without the filtering, and also the filtered ids). Note that you would typically want to use the filtered pair ids for downstream analysis. If there are any reads missing, it's always possible to reconstruct expected pairs from either the pair_stats file (which has some additional metrics), or by the pair_ids.txt file (if you are ok with some false positives).