Error in rule cnt_muts: jobid: 17

AlessandroPilli commented 6 months ago

Dear Isaac,

This is the first time I'm working with bam2bakR and unfortunately I’m encountering some issues. I'm starting from Bam files that have been properly indexed. After the initial settings completed successfully, I have modified the config.yml file with the relative paths (that should be right) for every file, annotation, and genome. The message error I get when running the pipeline is the one reported below:

Error in rule cnt_muts: jobid: 17 input: results/sf_reads/FLT_2.s.bam, results/snps/snp.txt output: results/counts/FLT_2_counts.csv.gz, results/counts/FLT_2_check.txt log: logs/cnt_muts/FLT_2.log (check log file(s) for error details) conda-env: /storage/shared/alessandro/BMTimeLapse/bam2bakR/.snakemake/conda/02fdeccb1c42bceb5704cb87df1cdb25 shell:

    chmod +x /home/alessandro/.cache/snakemake/snakemake/source-cache/runtime-cache/tmp74wpq9fy/https/raw.githubusercontent.com/simonlabcode/bam2bakR/main/workflow/scripts/mut_call.sh
    chmod +x /home/alessandro/.cache/snakemake/snakemake/source-cache/runtime-cache/tmp74wpq9fy/https/raw.githubusercontent.com/simonlabcode/bam2bakR/main/workflow/scripts/mut_call.py
    chmod +x /home/alessandro/.cache/snakemake/snakemake/source-cache/runtime-cache/tmp74wpq9fy/https/raw.githubusercontent.com/simonlabcode/bam2bakR/main/workflow/scripts/fragment_sam.awk
    /home/alessandro/.cache/snakemake/snakemake/source-cache/runtime-cache/tmp74wpq9fy/https/raw.githubusercontent.com/simonlabcode/bam2bakR/main/workflow/scripts/mut_call.sh 4 FLT_2 results/sf_reads/FLT_2.s.bam results/snps/snp.txt results/counts/FLT_2_counts.csv.gz results/counts/FLT_2_check.txt 40 TC PE R /home/alessandro/.cache/snakemake/snakemake/source-cache/runtime-cache/tmp74wpq9fy/https/raw.githubusercontent.com/simonlabcode/bam2bakR/main/workflow/scripts/mut_call.py /home/alessandro/.cache/snakemake/snakemake/source-cache/runtime-cache/tmp74wpq9fy/https/raw.githubusercontent.com/simonlabcode/bam2bakR/main/workflow/scripts/fragment_sam.awk False 1> logs/cnt_muts/FLT_2.log 2>&1

    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

as reported I ceked into the log file and the error reported is:

The number of fragments for sample FLT_2 is 4
The fragment size is set to 8479184 alignments per fragment
Aligned .sam file fragmented for sample FLT_2 Start: 2024-05-22 16:40:52.411847 [E::idx_find_and_load] Could not retrieve index file for './results/counts/1_FLT_2_frag.bam'

The .bai file is in the same folder of the bam file so the comment Could not retrieve index file is strange for me to notice. The number 1, in this last error message, before the file name is something added by an internal process I guess, since i don't have a file named like that. Also I dont' understand why is starting from the FLT_2 file since is not the first one specified in the config.yml file.

Another question that I have is: If I have to re-run the process do I have to delate some file/folders that the previous run generated and may create some problems?

I'm providing the config.yml (as a txt) and the FLT_2.log to facilitate the understanding of the problem.

FLT_2.log config.txt

Is this the first time for you encountering something like this?

Best regards, Alessandro

isaacvock commented 6 months ago

Hello Alessandro,

The "Could not retrieve index" message is a known pysam bug that does not affect anything, and is not the actual source of the problem that is crashing the pipeline for you.

Rather, if you look at the bottom of the log file, you will see that it says "MD tag not present". As discussed in the Requirements for bam2bakR section of the documentation, it is crucial that your bam file have an MD tag, which is what is used to identify mutations and determine what the reference base was. Not all aligners provide MD tags by default, but all aligners can be asked to include the MD tag in their output. For example, if you are aligning your reads with STAR, you have to include the flag --outSAMattributes NH HI AS nM MD in your call to STAR. The first four are the default tags, and the 5th tells it to include the MD tag.

In regards to your second question, if you provide new bam files to bam2bakR, it will automatically rerun the full pipeline and thus overwrite all existing files, because a file that acts as input to the entire pipeline has been modified. In general though, it is always good practice to go ahead and delete files that you believe/know have problems that will affect later steps. In this case for example, the bam files in the sf_reads directory will lack the necessary MD tag, since they were derived from a bam file lacking this tag. Thus, deleting them is a good idea even if not strictly necessary.

Two final comments: 1) The order you list the files in the config is not necessarily the order in which Snakemake will run the relevant rules. The runtime order is chosen quasi-randomly by Snakemake. 2) The file name listed in the pysam message is a temporary file created by bam2bakR.

Best regards, Isaac

AlessandroPilli commented 6 months ago

Hi Isaac,

Thank you for answering so quickly, my fault for missing that detail, I will run STAR again including the new flag.

Alessandro

AlessandroPilli commented 6 months ago

Hi again Isaac,

As you suggested I added the MD tag re-running STAR, but now is giving me the same problem even if the MD tags are present since I checked in the BAM file. Do you know why this could be happening?

Also as you said I deleted the files (and I can't restore it) in the sf_read folder but now the pipeline is missing those files and stops as soon as I run telling me:

MissingInputException in rule sort_filter in file https://raw.githubusercontent.com/simonlabcode/bam2bakR/main/workflow/rules/bam2bakr.smk, line 40: Missing input files for rule sort_filter: output: results/sf_reads/AF_1.s.bam, results/sf_reads/AF_1_fixed_mate.bam, results/sf_reads/AF_1.f.sam wildcards: sample=AF_1 affected files: BM_TimeLapse/bam/AF_1.bam

My bam file is present

Best regards, Alessandro

isaacvock commented 6 months ago

Hello Alessandro,

Did the sort_filter step rerun when you got the same error? It sounds like when you tried to run the pipeline without deleting the sf_reads folder it just used the bam files in that directory, leading to the same error you got last time. Thus, one way or the other, the sf_reads folder needed to get overwritten or deleted for you to stop encountering that error. I would have expected Snakemake to overwrite this directory when you provide it new input bam files, but sometimes Snakemake's rule rerunning logic confuses me.

The error you are showing suggests that there is some sort of typo in the path and/or file name for the input bam file. Please double check that the file really is located in the BM_TimeLapse/bam directory (relative to where you launched the workflow from) and that the file is named AF_1.bam. There is no other potential explanation for this error that comes to mind.

Best, Isaac

AlessandroPilli commented 6 months ago

I have double-checked every path and they are correct. Probably the best idea now is to remove every unnecessary folder and start from the beginning with the deploy-workflow part. Since I don't remember exactly the folders that were created automatically by the functions at the beginning, could you please tell me which ones I should keep, such as the configuration folder, if it's not enough for me to empty them all?

Sincerely Alessandro

isaacvock commented 6 months ago

The config/ and workflow/ folders and it's contents are created by deploy-workflow. You can delete these as well as rhe logs/ and results/ directories created by bam2bakR.

I'm sorry to harp on this, but I am not confident that redeploying the workflow will address the error you are getting. The error is one I get all the time when I make a subtle error in the paths. Because you are supplying a relative path in your config, please make sure you are running snakemake ... from inside a directory that looks like:

config/
    |
    |--> config.yaml
BM_TimeLapse/
    |
    |--> bam/
        |
        |--> AF_1.bam
workflow
    |
    |--> Snakefile

Another thing you can try is sepcifying the absolute path to the bam file rather than the relative one. In other words, navigate to the directory with the bam file in it, run pwd and in the config specify the path as the output of pwd followed by the file name.

In summary the steps I would follow are: 1) Delete the results/ and logs/ directory from the old pipeline run 2) Make sure your directory structure looks as described above. 3) If error persists, try specifying absolute paths to the bam files 4) If error still persists, then try redeploying the workflow. Also please let me know that you tried 1-3 and please provide your config.yaml file, the full output of what gets printed to the console when you run snakemake, and a screenshot of what you get when you run ls from inside the directory you are running snakmake

Best, Isaac

AlessandroPilli commented 6 months ago

Dear Isaac thank you for the support, I finally managed to run the pipeline without error. I don't know if it was for the rearrangement of the folders or the use of complete paths, but the deletion of old incorrect results was necessary.

Thanks again

Best regards, Alessandro

simonlabcode / bam2bakR

Error in rule cnt_muts: jobid: 17 #11

as reported I ceked into the log file and the error reported is:

Aligned .sam file fragmented for sample FLT_2 Start: 2024-05-22 16:40:52.411847 [E::idx_find_and_load] Could not retrieve index file for './results/counts/1_FLT_2_frag.bam'