jbloomlab / dms_tools2

software for the analysis and visualization of deep mutational scanning data
GNU General Public License v3.0
29 stars 20 forks source link

Question about alignment #48

Closed tos2014 closed 4 years ago

tos2014 commented 4 years ago

Hi Jesse, Thank you for fixing the --bclen issue.

When I run dms2_batch_bcsubamp, the program seems to stop retaining reads after the first 110 reads or so for each sample.

When I manually check the text in the fastq files, I can see more than 110 reads. In the case for the TIMM23B amplicon, there are 11,633 reads.

Attached are the statistics for all samples and the fastq-R2 (in which reads begin with the 10-nucleotide barcode) for sample X1, with the resulting --bcinfo output for the same sample X1.

ALL_readstats.pdf X1_bcinfo.csv.gz UDP0001_R2_001.fastq.gz

jbloom commented 4 years ago

Hmm, this does sound like some sort of problem. Sorry that the program is turning out to be so buggy for your purpose.

In order for me to de-bug this, can you post the following:

I will then run it and see what happens on my end. If those files are too large to attach here or they involve project-specific details so you don't want to post them publicly, you can create a Google Drive link and send via e-mail.

tos2014 commented 4 years ago

Thanks! Here are the updated information zipped and attached.

This is the command I used after I cd into the directory. dms2_batch_bcsubamp --refseq TIMM23B.txt --alignspecs 2,432,1,11 --batchfile Batch.csv --summaryprefix ALL --bclen 0 --bclen2 10 --maxmuts 15 --minreads 1

For --alignspecs, the "2,432" is because FASTQ-R1 reads start from the second nucleotide in the --refseq file. The first (extra) base in the --refseq file was included because the program requires a multiple of 3 for the number of nucleotides in the --refseq file. The 11 is because alignment of the FASTQ-R2 reads with the --refseq file begins with the 11th base, after a 10-base barcode.

I set --maxmuts to 15 (an arbitrarily high number) because I am searching for a specific A->I(G) mutation in this amplicon and wanted to maximize the number of alignments.

TonySun.zip

jbloom commented 4 years ago

@tos2014: This is some strange bug with your gzipped FASTQ files. I'm not sure if the format is slightly corrupted? Basically the FASTQ reader in pysam is silently truncating them after the first 114 entries. I'm not able to debug the problem in pysam, although I have raised an issue there: https://github.com/pysam-developers/pysam/issues/919

However, it seems to work fine if you gunzip your FASTQ files and then just run things on the gunzipped files: *.fastq rather than *.fastq.gz.

tos2014 commented 4 years ago

@jbloom Thank you for diagnosing and resolving this issue! Another question, if you don't mind, why are some barcodes and the associated sequences with a R1/R2 count of 1 not able to align, as specified in the --bcinfo file? For example, using the prior uploaded data, in the --bcinfo file (attached), searching for barcode#13 AGGAGGACGG in the FASTQ file (UDP0001_R2_001.fastq) shows the following sequence: AGGAGGACGGGCACAACTATGCACATACACACTTCCAAATTCACAAACCATGCTTGTGCACAC... The sequence after the barcode matches the reverse complement of the end of the sequence in the --refseq file but the sequence is not printed in the Consensus sequence column. X1_bcinfo.csv.zip

Update: when I added --minfraccall 0.01 --R1trim 78 --minconcur 1.0, the barcode#13 AGGAGGACGG is now showing up with a consensus sequence in the --bcinfo file. The number "78" for --R1trim corresponds to the base number in R1 reads that represents that A->G mutation of interest. Afterward, I manually use the filter functions in Excel to print the base at position 78 and then calculate the mutation rate A/G.

I think I got it now! Thanks for your help!

jbloom commented 4 years ago

OK, just looking at this now. It sounds from the second comment that it looks like you have it squared away. If there is still an outstanding question, let me know.

tos2014 commented 4 years ago

@jbloom Yeah I have a way to consistently get the consensus sequences printed in the csv file for downstream analysis. Thanks very much for your help!