liumz93 / PEM-Q

a pipeline to process data of PEM-seq or data similar, which is more comprehensive than superQ
8 stars 6 forks source link

new issues with PEM-Q.py #7

Open aprilW0829 opened 1 month ago

aprilW0829 commented 1 month ago

I encountered the following error when I test example data. Is there any solution to this error? Thanks.

[PEM-Q] primerChrom: chr15 [PEM-Q] primer_start: 61986633 [PEM-Q] primer_end: 61986652 [PEM-Q] primer_strand: + [PEM-Q] primer: GGAAACCAGAGGGAATCCTC [PEM-Q] adapter: CCACGCGTGCTCTACA [PEM-Q] genome: /data2/wangxin/database/bwa/bwa_mm10/mm10 [PEM-Q] fastq_r1: CC055c_R1.fq.gz [PEM-Q] fastq_r2: CC055c_R2.fq.gz [PEM-Q] your adapter sequence: CCACGCGTGCTCTACA [PEM-Q] align to adapter... [FLASH] WARNING: An unexpectedly high proportion of combined pairs (53.72%) overlapped by more than 65 bp, the --max-overlap (-M) parameter. Consider increasing this parameter. (As-is, FLASH is penalizing overlaps longer than 65 bp when considering them for possible combining!) [PEM-Q] stitching reads using FLASh... [PEM-Q] bwa mem -t 8 adapter/adapter -k 10 -L 0 -T 10 CC055c_R2.fq.gz > bwa_align/CC055c_sti.adpt.sam 2>bwa_align/bwa_align_adapter.log [PEM-Q] sort and index bam... [PEM-Q] merging fastq files... mkdir: cannot create directory ‘bwa_align’: File exists [PEM-Q] index file used None//data2/wangxin/database/bwa/bwa_mm10/mm10//data2/wangxin/database/bwa/bwa_mm10/mm10 [PEM-Q] align to genome... [PEM-Q] bwa mem -Y -t 8 None//data2/wangxin/database/bwa/bwa_mm10/mm10//data2/wangxin/database/bwa/bwa_mm10/mm10 flash_out/CC055c.merge.fastq.gz > bwa_align/CC055c_sti.sam 2>bwa_align/bwa_alignstich.log [PEM-Q] filter no_primer reads... [PEM-Q] Your primer sequence: GGAAACCAGAGGGAATCCTC [PEM-Q] primer position: chr15 1 61986633 61986652 Traceback (most recent call last): File "/data2/wangxin/biosoft/PEM-Q/main/align_make_v5.1.py", line 536, in main() File "/data2/wangxin/biosoft/PEM-Q/main/align_make_v5.1.py", line 529, in main alignment.no_primer_filter() File "/data2/wangxin/biosoft/PEM-Q/main/align_make_v5.1.py", line 330, in no_primer_filter bam_file = pysam.AlignmentFile('bwa_align/'+self.bam_sort, 'rb') File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 991, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

liumz93 commented 1 month ago

Hello, thank you for your feedback. Based on the error message, it seems that the alignment file was not generated correctly. The alignment tool I am using is BWA, and the version I recently tested is 0.7.18-r1243-dirty. However, in general, the version of BWA should not affect the program's execution. My suggestion is that you first check the log files under data/bwa_align to see if there are more detailed error messages for further debugging.

aprilW0829 commented 1 month ago

Thank you for the quick reply, according to your recommended solution direction, I can already perform the PEM-Q.py to complete the fundamental analysis, but I seem to have encountered another problem as follows when testing the second step vector_analysis.py:

my code: vector_analyze.py CC055c pX330_SpCas9.fa bwa_mm10 chr1 + 7937 7956 error as follows: ‍[PEM-Q Vector Analysis] basename: CC055c [PEM-Q Vector Analysis] vector_fa: pX330_SpCas9.fa [PEM-Q Vector Analysis] genome: /data2/wangxin/database/bwa/bwa_mm10/bwa_mm10 [PEM-Q Vector Analysis] bait_chr: chr1 [PEM-Q Vector Analysis] bait_strand: + [PEM-Q Vector Analysis] sgRNA_start: 7937 [PEM-Q Vector Analysis] sgRNA_end: 7956 [PEM-Q Vector Analysis]seqtk subseq CC055c_R1.fq.gz indel/CC055c_discard.tab > CC055c_discard_R1.fq sh: seqtk: command not found [PEM-Q Vector Analysis]seqtk subseq CC055c_R2.fq.gz indel/CC055c_discard.tab > CC055c_discard_R2.fq sh: seqtk: command not found mkdir: cannot create directory ‘pX330_SpCas9’: File exists mkdir: cannot create directory ‘vector’: File exists [PEM-Q Vector Analysis] check file... [PEM-Q Vector Analysis] building vector index... [PEM-Q Vector Analysis] align pe_fq to vector... bwa mem -t 8 pX330_SpCas9/pX330_SpCas9 CC055c_discard_R1.fq CC055c_discard_R2.fq > pX330_SpCas9/CC055c_pe_vector.sam 2>pX330_SpCas9/bwa_align_pe_vector.log [PEM-Q Vector Analysis] sort and index bam... samtools view -S -b -h pX330_SpCas9/CC055c_pe_vector.sam > pX330_SpCas9/CC055c_pe_vector.bam && samtools sort pX330_SpCas9/CC055c_pe_vector.bam > pX330_SpCas9/CC055c_pe_vector.sort.bam && samtools index pX330_SpCas9/CC055c_pe_vector.sort.bam [PEM-Q Vector Analysis] align r2 to genome... bwa mem -t 8 -k 10 /home/mengzhu/database/bwa_indexes//data2/wangxin/database/bwa/bwa_mm10/bwa_mm10//data2/wangxin/database/bwa/bwa_mm10/bwa_mm10 CC055c_discard_R2.fq > pX330_SpCas9/CC055c_r2_genome.sam 2>pX330_SpCas9/bwa_align_pe_vector.log [PEM-Q Vector Analysis] sort and index bam... samtools view -S -b -h pX330_SpCas9/CC055c_r2_genome.sam > pX330_SpCas9/CC055c_r2_genome.bam && samtools sort pX330_SpCas9/CC055c_r2_genome.bam > pX330_SpCas9/CC055c_r2_genome.sort.bam && samtools index pX330_SpCas9/CC055c_r2_genome.sort.bam [PEM-Q Vector Analysis] processing primer filter... primer filter left: 0 [PEM-Q Vector Analysis] generating proper pair tab... paired: 0 r1: 0 r2: 0 [E::idx_find_and_load] Could not retrieve index file for 'pX330_SpCas9/CC055c_r1.paired.sort.bam' [E::idx_find_and_load] Could not retrieve index file for 'pX330_SpCas9/CC055c_r2.paired.sort.bam' Traceback (most recent call last): File "/data2/wangxin/biosoft/PEM-Q/vector_analyze.py", line 433, in main() File "/data2/wangxin/biosoft/PEM-Q/vector_analyze.py", line 428, in main proper_pair_tab(**kwargs) File "/data2/wangxin/biosoft/PEM-Q/vector_analyze.py", line 221, in proper_pair_tab r2_genome_bam = pysam.AlignmentFile(directory_store+"/"+basename+"_r2_genome.sort.bam",'rb') File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 991, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

I tried to check the pX330_SpCas9/bwa_align_pe_vector.log files following the instructions you made last referral , and it still shows [E::bwa_idx_load_from_disk] fail to locate the index files, but the successful results of the previous step can already show that the established genome mm10 index is not problematic, I don't know what the reason is here.

Looking forward to your reply, I would appreciate it . Best wishes! wangxin970829

@. | ---- Replied Message ---- | From | Mengzhu @.> | | Date | 10/10/2024 01:18 | | To | @.> | | Cc | @.>, @.***> | | Subject | Re: [liumz93/PEM-Q] new issues with PEM-Q.py (Issue #7) |

Hello, thank you for your feedback. Based on the error message, it seems that the alignment file was not generated correctly. The alignment tool I am using is BWA, and the version I recently tested is 0.7.18-r1243-dirty. However, in general, the version of BWA should not affect the program's execution. My suggestion is that you first check the log files under data/bwa_align to see if there are more detailed error messages for further debugging.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

liumz93 commented 1 month ago

Glad to hear that you successfully ran PEM-Q. Regarding the vector analysis, you first need to install a dependency package called seqtk (https://github.com/lh3/seqtk). Since vector insertion events are relatively rare, and the test FASTQ file I provided only includes a small subset of reads, I recommend downloading the complete sequencing file from this link: https://www.biosino.org/node/sample/detail/OES00075922 for a more comprehensive vector analysis.

Additionally, I rechecked the code and noticed a path error in line 127. The original path "/home/mengzhu/database/bwa_indexes/" is my old path, I've modified both tools/align_inser_va.py and vector_analyze.py. Please download the data and the latest code and re-run the analysis.

Thanks! L.

aprilW0829 commented 2 weeks ago

Thanks for ypur response! I will try again, But now I ran into another problem that I didn't seem to understand very well, I had been testing the test data to run successfully, but now I used to do the analysis of the self-test data, there are 9/10 samples that can't be analyzed suceessfully. In the analysis process of PEM-Q , the error is reported, I check the log file of the run, and it seems that it should not be a problem with the bwa comparison, it seems that error arised in step 6 of the run "######## 06 substitutions... ######", Where is the error problem here?

My code is as follows: PEM-Q.py bwa_hg38 PI-NT-Bio-OT3 43709929 chr20 43709995 43710014 + TGGGGTGTGGGGAGTGTGT The log process is as follows:

[PEM-Q] genome: bwa_hg38 [PEM-Q] sample: PI-NT-Bio-OT3 [PEM-Q] cutsite: 43709929 [PEM-Q] primer: TGGGGTGTGGGGAGTGTGT [PEM-Q] primer_chr: chr20 [PEM-Q] primer_start: 43709995 [PEM-Q] primer_end: 43710014 [PEM-Q] primer_strand: + ######## 01 Reads alignment... ######## align_make_v5.1.py bwa_hg38 PI-NT-Bio-OT3_R1.fq.gz PI-NT-Bio-OT3_R2.fq.gz -a CCACGCGTGCTCTACA -p TGGGGTGTGGGGAGTGTGT -r chr20 -s 43709995 -e 43710014 -d + [PEM-Q] primerChrom: chr20 [PEM-Q] primer_start: 43709995 [PEM-Q] primer_end: 43710014 [PEM-Q] primer_strand: + [PEM-Q] primer: TGGGGTGTGGGGAGTGTGT [PEM-Q] adapter: CCACGCGTGCTCTACA [PEM-Q] genome: bwa_hg38 [PEM-Q] fastq_r1: PI-NT-Bio-OT3_R1.fq.gz [PEM-Q] fastq_r2: PI-NT-Bio-OT3_R2.fq.gz [PEM-Q] your adapter sequence: CCACGCGTGCTCTACA [PEM-Q] align to adapter... [FLASH] WARNING: An unexpectedly high proportion of combined pairs (57.21%) overlapped by more than 65 bp, the --max-overlap (-M) parameter. Consider increasing this parameter. (As-is, FLASH is penalizing overlaps longer than 65 bp when considering them for possible combining!) [PEM-Q] stitching reads using FLASh... [PEM-Q] merging fastq files... [PEM-Q] bwa mem -t 8 -k 10 -L 0 -T 10 adapter/adapter PI-NT-Bio-OT3_R2.fq.gz > bwa_align/PI-NT-Bio-OT3_sti.adpt.sam 2>bwa_align/bwa_align_adapter.log [PEM-Q] sort and index bam... mkdir: cannot create directory ‘bwa_align’: File exists [PEM-Q] index file used /data2/wangxin/database/bwa/bwa_hg38/bwa_hg38 [PEM-Q] align to genome... [bam_sort_core] merging from 7 files... [PEM-Q] bwa mem -Y -t 8 /data2/wangxin/database/bwa/bwa_hg38/bwa_hg38 flash_out/PI-NT-Bio-OT3.merge.fastq.gz > bwa_align/PI-NT-Bio-OT3_sti.sam 2>bwa_align/bwa_alignstich.log [bam_sort_core] merging from 8 files... [PEM-Q] filter no_primer reads... [PEM-Q] Your primer sequence: TGGGGTGTGGGGAGTGTGT [PEM-Q] primer position: chr20 1 43709995 43710014 [PEM-Q] pass primer filter stitch reads: 0

align_make.py Done in 529.653s ######## 02 Barcode Extract... ######## rmb_dedup_v4.py PI-NT-Bio-OT3 17 CCACGCGTGCTCTACA [PEM-Q] basename: PI-NT-Bio-OT3 [PEM-Q] barcode_length: 17 [PEM-Q] adapter_seqeunce: CCACGCGTGCTCTACA [PEM-Q] align to check adapter... [PEM-Q] bwa mem -t 8 -k 5 -L 0 -T 14 adapter/adapter flash_out/PI-NT-Bio-OT3.merge.fastq.gz > barcode/PI-NT-Bio-OT3_check.adpt.sam 2>barcode/bwa_align_adapter.log [PEM-Q] sort and index bam... [bam_sort_core] merging from 8 files...

rmb_dedup.py Done in 334.138s ######## 03 Define transloc... ######## define_transloc_v5.1_mpf.py PI-NT-Bio-OT3 43709929 [PEM-Q] basename: PI-NT-Bio-OT3 [PEM-Q] cutsite: 43709929 [E::idx_find_and_load] Could not retrieve index file for 'barcode/PI-NT-Bio-OT3_sti.filter.sort.bam' multiple junctions remove 0

define_transloc.py Done in 0.02s ######## 04 Define indels... ######## define_indel_v5.1_mpf.py PI-NT-Bio-OT3 43709929 + [PEM-Q] basename: PI-NT-Bio-OT3 [PEM-Q] cutsite: 43709929 [PEM-Q] primer_strand: + [E::idx_find_and_load] Could not retrieve index file for 'indel/PI-NT-Bio-OT3_indel.sort.bam' indel/PI-NT-Bio-OT3_indel_mut.tab 0 Bait_end Prey_start 43709929 transloc/PI-NT-Bio-OT3_mut.tab 0 raw/PI-NT-Bio-OT3_SID_all.tab 0

define_indel.py Done in 0.026s ######## 05 DEDUP... ######## dedup_v5.1_mpf.py PI-NT-Bio-OT3 [PEM-Q] basename: PI-NT-Bio-OT3 repeats_dedup.py PI-NT-Bio-OT3_SID_all.tab 30 -f 'Rname,Strand,Bait_end,Junction' file: PI-NT-Bio-OT3_SID_all.tab mapqual_cutoff: 30 feature_list: ['Rname', 'Strand', 'Bait_end', 'Junction']

repeats_dedup.py Done in 0.008s repeats_dedup.py PI-NT-Bio-OT3_SV.tab 30 -f 'Rname,Strand,Bait_end,Junction' file: PI-NT-Bio-OT3_SV.tab mapqual_cutoff: 30 feature_list: ['Rname', 'Strand', 'Bait_end', 'Junction']

repeats_dedup.py Done in 0.008s repeats_dedup.py PI-NT-Bio-OT3_Insertion.tab 30 -f 'Rname,Strand,Bait_end,Insertion,Junction' file: PI-NT-Bio-OT3_Insertion.tab mapqual_cutoff: 30 feature_list: ['Rname', 'Strand', 'Bait_end', 'Insertion', 'Junction']

repeats_dedup.py Done in 0.007s repeats_dedup.py PI-NT-Bio-OT3_Insertion_sv.tab 30 -f 'Rname,Strand,Bait_end,Insertion,Junction' file: PI-NT-Bio-OT3_Insertion_sv.tab mapqual_cutoff: 30 feature_list: ['Rname', 'Strand', 'Bait_end', 'Insertion', 'Junction']

repeats_dedup.py Done in 0.007s repeats_dedup.py PI-NT-Bio-OT3_Deletion.tab 30 -f 'Rname,Strand,Bait_end,Junction' file: PI-NT-Bio-OT3_Deletion.tab mapqual_cutoff: 30 feature_list: ['Rname', 'Strand', 'Bait_end', 'Junction']

repeats_dedup.py Done in 0.007s repeats_dedup.py PI-NT-Bio-OT3_Deletion.tab 30 -f 'Rname,Strand,Bait_end,Junction' file: PI-NT-Bio-OT3_Deletion.tab mapqual_cutoff: 30 feature_list: ['Rname', 'Strand', 'Bait_end', 'Junction']

repeats_dedup.py Done in 0.007s repeats_dedup.py PI-NT-Bio-OT3_smallindel_cutoff.tab 30 -f 'Rname,Strand,Bait_end,Junction' file: PI-NT-Bio-OT3_smallindel_cutoff.tab mapqual_cutoff: 30 feature_list: ['Rname', 'Strand', 'Bait_end', 'Junction']

repeats_dedup.py Done in 0.018s repeats_dedup.py PI-NT-Bio-OT3_transloc.tab 30 -f 'Rname,Strand,Bait_end,Junction' file: PI-NT-Bio-OT3_transloc.tab mapqual_cutoff: 30 feature_list: ['Rname', 'Strand', 'Bait_end', 'Junction']

repeats_dedup.py Done in 0.007s repeats_dedup.py PI-NT-Bio-OT3_smallindel.tab 30 -f 'Rname,Strand,Bait_end,Junction' file: PI-NT-Bio-OT3_smallindel.tab mapqual_cutoff: 30 feature_list: ['Rname', 'Strand', 'Bait_end', 'Junction']

repeats_dedup.py Done in 0.007s

dedup.py Done in 4.162s ######## 06 substitutions... ######## define_substitution.py PI-NT-Bio-OT3 43709929 10 + [PEM-Q] basename: PI-NT-Bio-OT3 [PEM-Q] cutsite: 43709929 [PEM-Q] cutoff: 10 [PEM-Q] primer_strand: + Substitutions in cutsite +- cutoff: 0 0 Processing statistics... Traceback (most recent call last): File "/data2/wangxin/biosoft/PEM-Q/main/define_substitution.py", line 181, in main() File "/data2/wangxin/biosoft/PEM-Q/main/define_substitution.py", line 176, in main define_substitution(kwargs) File "/data2/wangxin/biosoft/PEM-Q/main/define_substitution.py", line 126, in define_substitution chrom = data["Bait_rname"][0] File "/home/wangxin/miniconda3_1/envs/PEM-Q/lib/python3.8/site-packages/pandas/core/series.py", line 879, in getitem return self._values[key] IndexError: index 0 is out of bounds for axis 0 with size 0 ######## 07 filter and Statistics... ######## define_statistics_add_filter.py PI-NT-Bio-OT3 bwa_hg38 43709929 500000 TGGGGTGTGGGGAGTGTGT chr20 + CCACGCGTGCTCTACA [PEM-Q] basename: PI-NT-Bio-OT3 [PEM-Q] genome: bwa_hg38 [PEM-Q] cutsite: 43709929 [PEM-Q] transloc_range: 500000 [PEM-Q] primer: TGGGGTGTGGGGAGTGTGT [PEM-Q] primer_chr: chr20 [PEM-Q] primer_strand: + [PEM-Q] adapter: CCACGCGTGCTCTACA Processing junction filter and statistics... Traceback (most recent call last): File "/data2/wangxin/biosoft/PEM-Q/main/define_statistics_add_filter.py", line 276, in main() File "/data2/wangxin/biosoft/PEM-Q/main/define_statistics_add_filter.py", line 271, in main statistics_add_filter(kwargs) File "/data2/wangxin/biosoft/PEM-Q/main/define_statistics_add_filter.py", line 62, in statistics_add_filter germ = pd.read_csv(germ_file, sep = '\t', index_col=False, low_memory=False) File "/home/wangxin/miniconda3_1/envs/PEM-Q/lib/python3.8/site-packages/pandas/io/parsers.py", line 688, in read_csv return _read(filepath_or_buffer, kwds) File "/home/wangxin/miniconda3_1/envs/PEM-Q/lib/python3.8/site-packages/pandas/io/parsers.py", line 454, in _read parser = TextFileReader(fp_or_buf, kwds) File "/home/wangxin/miniconda3_1/envs/PEM-Q/lib/python3.8/site-packages/pandas/io/parsers.py", line 948, in init self._make_engine(self.engine) File "/home/wangxin/miniconda3_1/envs/PEM-Q/lib/python3.8/site-packages/pandas/io/parsers.py", line 1180, in _make_engine self._engine = CParserWrapper(self.f, self.options) File "/home/wangxin/miniconda3_1/envs/PEM-Q/lib/python3.8/site-packages/pandas/io/parsers.py", line 2010, in init self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 382, in pandas._libs.parsers.TextReader.cinit File "pandas/_libs/parsers.pyx", line 674, in pandas._libs.parsers.TextReader._setup_parser_source FileNotFoundError: [Errno 2] No such file or directory: 'unique/PI-NT-Bio-OT3_Germline_final.tab' All done, please check log files! PEM-Q Done in 871.16s

Mingdi-W commented 1 week ago

@aprilW0829 I'v seen this problem. There are some parameters, u should change and check. "[PEM-Q] pass primer filter stitch reads: 0", this step limits ur barcode length is 4。 After u correct this step ,u can get lots of reads。

aprilW0829 commented 1 week ago

Hello, ‍Thanks for your reply! ‍You said that you can choose the parameter u to adjust, I have two more questions. The first is that the barcode length you defined is 4, where the barcode refers to which part of the sequence in the sequencing, (The sequence library should consist of the following parts (as shown in the figure): R1: primer, the edited intermediate sequence, and R2: the edited intermediate sequence, adaper, and UMI [unique molecular identifier]). And the second question is,which python program is this u an optional parameter? Because I ran this program PEM--Q.py and I didn't see the u parameter.

Looking forward to your reply, I would appreciate it . Best wishes! wangxin970829

@. | ---- Replied Message ---- | From | @.> | | Date | 11/6/2024 16:35 | | To | @.> | | Cc | @.>, @.***> | | Subject | Re: [liumz93/PEM-Q] new issues with PEM-Q.py (Issue #7) |

@aprilW0829 I'v seen this problem. There are some parameters, u should change and check. "[PEM-Q] pass primer filter stitch reads: 0", this step limits ur barcode length is 4。 After u correct this step ,u can get lots of reads。

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Mingdi-W commented 1 week ago

@aprilW0829 In this pipeline, the author uses primer-seq to locate barcode, which is used for library construction and sequencing, not rmb. u can check some reads to confirm ur barcode length. In main/align_make_v5.1.py 354 and 355 line, u can change barcode number. "condition3 = read_check <= primer_check + 4#(barcode number) condition4 = read_check >= primer_check - 4"