lhqing / cemba_data

Mapping pipeline for snmC-seq based technologies.
https://hq-1.gitbook.io/mc/
MIT License
15 stars 7 forks source link

fastq pattern path error in demultiplex #11

Closed emarti88 closed 3 years ago

emarti88 commented 3 years ago

Hello,

I am running yap demultiplex with YAP with the following command:

yap demultiplex --fastq_pattern "/path/to/fastq_files/snmC-seq2//fastq.gz" --output_dir /outdir/demux_output --config_path ./hg19_mapping_config.txt --cpu 4

However, I keep getting the following error:

Message: 'No fastq name remained, check if the name pattern is correct.'

Am I formatting the fastq_pattern correctly? I am certain that is where the fastq files are. Do you have an exact example of how to format the fastq_path?

Thank you. Eduardo

emarti88 commented 3 years ago

Below is the entire message:

r1_right_cut = 10 r2_left_cut = 10 r2_right_cut = 10 quality_threshold = 20 length_threshold = 30 total_read_pairs_min = 1 total_read_pairs_max = 6000000 mapq_threshold = 10 num_upstr_bases = 0 num_downstr_bases = 2 compress_level = 5 unmapped_fastq = False unmapped_param_str = '' mode = 'mc' barcode_version = 'V2' r1_adapter = 'AGATCGGAAGAGCACACGTCTGAAC' r2_adapter = 'AGATCGGAAGAGCGTCGTGTAGGGA' bismark_reference = '/dcl01/FB2/data/personal/erafaelm/genomes/hg19/Bisulfite_Genome' reference_fasta = '/dcl01/FB2/data/personal/erafaelm/genomes/hg19/genome.fa' chrom_sizes_file = 'CHANGE_THIS_TO_YOUR_CHROM_SIZES_FILE' mc_stat_feature = 'CHN CGN CCC' mc_stat_alias = 'mCH mCG mCCC'

24 FASTQ file paths in input Traceback (most recent call last): File "/users/erafaelm/.conda/envs/mapping/lib/python3.7/site-packages/cemba_data/demultiplex/fastq_dataframe.py", line 58, in _parse_v2_fastq_path assert primer_name[0] in 'ABCDEFGHIJKLMNOP' AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/users/erafaelm/.conda/envs/mapping/lib/python3.7/site-packages/cemba_data/demultiplex/fastq_dataframe.py", line 64, in _parse_v2_fastq_path raise ValueError ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/users/erafaelm/.conda/envs/mapping/bin/yap", line 8, in sys.exit(main()) File "/users/erafaelm/.conda/envs/mapping/lib/python3.7/site-packages/cemba_data/main.py", line 602, in main func(**args_vars) File "/users/erafaelm/.conda/envs/mapping/lib/python3.7/site-packages/cemba_data/demultiplex/demultiplex.py", line 455, in demultiplex_pipeline cpu=demultiplex_cpu) File "/users/erafaelm/.conda/envs/mapping/lib/python3.7/site-packages/cemba_data/demultiplex/demultiplex.py", line 51, in _demultiplex 'fastq_dataframe.csv') File "/users/erafaelm/.conda/envs/mapping/lib/python3.7/site-packages/cemba_data/demultiplex/fastq_dataframe.py", line 113, in make_fastq_dataframe name_series = parser(path) File "/users/erafaelm/.conda/envs/mapping/lib/python3.7/site-packages/cemba_data/demultiplex/fastq_dataframe.py", line 66, in _parse_v2_fastq_path raise ValueError(f'Found unknown name pattern in path {path}') ValueError: Found unknown name pattern in path /dcl01/FB2/data/core/sequencing/hiseq/HiSeq388_scMethyl/snmC-seq2/6274_3/SU383-C-snmC-seq2_S3_L006_R1_001.fastq.gz

lhqing commented 3 years ago

Hi Eduardo,

Can you provide more information about your input data?

Please note that YAP is a pipeline designed specifically for snmC-seq based data generated in the Ecker Lab, I have to define many specific things based on our data generation process. Therefore, I don't estimate this pipeline can be directly applicable to other single-cell methylome datasets, especially the demultiplexing step, which I don't have general support to data generated outside my lab.

But the demultiplex step is essentially using cutadapt functions. You can see cutadapt documentation here: https://cutadapt.readthedocs.io/en/stable/guide.html#demultiplexing

If you can get single-cell FASTQ files from your raw data, you may be able to use YAP to prepare snakemake files based on this: https://hq-1.gitbook.io/mc/mapping-form-cell-level-fastq-files

Best Hanqing

emarti88 commented 3 years ago

Hi Hanqing,

Thanks for your response. The libraries were prepared with the snmC-seq2 method as described by the Ecker lab. It was only a trial run and we have only 96 cells for a single sample. We prepared the library with 12 different standard dual indexes. Those have been demuxed (so we have 12 different folders on that demuxing; hence the fastq pattern having multiple folders path//fastq) in which there should be 8 cells represented in each of the folders' fastq files. Those fastq files need demuxing according to your 6bp in line sequences.

Is it better to pool all the fastq files in a single folder to make it work better? Or do you think there might be a problem with the config file? Please see above for all the output after running the command before the error.

Do you have any thoughts?

Thanks, Eduardo

lhqing commented 3 years ago

Hi Eduardo,

I understand you are using the snmC-seq2 protocol, to demultiplex your fastq files, you can use the cutadapt demultiplex function, following this part of the documentation: https://hq-1.gitbook.io/mc/#important-note. This is also what I used for my data. Once you get the single-cell FASTQ file pairs, you can map them using this part of documentation here: https://hq-1.gitbook.io/mc/mapping-form-cell-level-fastq-files

As I noted in the documentation this pipeline is customized for many ongoing projects in the lab, so I do not aim to provide support for general use cases due to my time limitation. I hope you understand. But I am willing to discuss any problems you met when analyzing your snmC-seq2 data.

Best Hanqing