veltenlab / CloneTracer

This repository contains scripts to identify healthy and malignant cells from scRNAseq with CloneTracer and process data from Optimized 10x libraries
MIT License
24 stars 1 forks source link

alignment_summary_stats, numpy error #2

Closed erkinacar5 closed 7 months ago

erkinacar5 commented 1 year ago

Hello again! I have been running CloneTracer/nuclear-snv snakemake workflow and I got an error at the alignment_summary_stats step. It looks like there is a mismatch between some headers. Is there any experience about this error on your side? I am not sure if this is caused by some wrong version of NumPy (which shouldn't be the case, since conda environment is built by envs/CloneTracer.yml file), or if there is something wrong with input (which also sounds unlikely to me, since all the previous steps went smoothly). Any idea on how to solve this would be much appreciated!

I am attaching the stdout from the cluster and the error log. Please let me know if you need more information :) alignment_positions.log lsf.o233886170.txt

sergibeneyto commented 1 year ago

Hello,

the problem may be that numpy was installed via pip in the conda environment and apparently that can lead to conflicts. I have updated the conda environment installing all packages via anaconda.

Give it a try and let me know if this problem continues. Also note that the conda environment to run to workflow has been renamed to envs/optimized10x.yml to be consistent with the nomenclature we used in the manuscript.

Thapeachydude commented 1 year ago

Hi @sergibeneyto,

we met at the SCG this year : ). I'm working with @erkinacar5 on this. I've downloaded the current github version and ran the snakemake workflow with the conda environment. The tools runs further, but crashes at the make_count step.

As a side note: We're running this tool on 10x 5' data, without any PCR amplfication of genes of interest. Could this be related? The other steps of the script seem to finish as intended. Also I was wondering two things:

make_count_table.log

I'm attaching the log files.

sergibeneyto commented 1 year ago

Hi @Thapeachydude,

nice to hear from you again :)

Did you align your reads to the hg38 genome from UCSC? I just found out the make_count_table step wouldn't work due to inconsistency with the contig names (chr + number in contrast to just the contig number in the ensembl genome). I fixed the script. Could you try to run it again with the newest version of make_count_table.py?

If you mapped to the ensembl genome, could you make sure that the contig names of input.bam and input.variants is the same? There may be an incompatibility there.

The script should work on 5' data too, as it is only looking at the specified positions and extracting reference and mutant counts per cell.

Let me know if the problem persists.

sergibeneyto commented 1 year ago

With regard to your questions about 5' 10x without PCR amplification. It would require a bit of re-writing to run the workflow with CellRanger 10x .bam file due to differences in some of the tags for instance in get_reads_cell step...

The adapter trimming and poly-A trimming are tailored to 10x 3' kit, so they will not trim much in the 5' libraries. I have not worked with 5' 10x data before, but if you find that there are adapter starting sequences that should be trimmed you could just replace them in the config["trim_starting_sequence"]["adapter_sequence"] slot in the config.yml file of the Snakemake workflow.

I hope this helps.

Thapeachydude commented 1 year ago

Hi @sergibeneyto,

thanks for the reply! I tried running the updated version of make_count_py and it seems that step now completes (I simply restarted the last snakemake submission). Now I'm stuck at rule table_to_rds. It seems the script is trying to install R packages again, although I activated the optimized10x conda environment.

Installing package into ‘/cluster/home/mpohly/R/4.1’
(as ‘lib’ is unspecified)
Warning: unable to access index for repository https://ftp.fau.de/cran/src/contrib:
  cannot open URL 'https://ftp.fau.de/cran/src/contrib/PACKAGES'
Installing package into ‘/cluster/home/mpohly/R/4.1’
(as ‘lib’ is unspecified)
Warning: unable to access index for repository https://ftp.fau.de/cran/src/contrib:
  cannot open URL 'https://ftp.fau.de/cran/src/contrib/PACKAGES'
Warning messages:
1: replacing previous import ‘lifecycle::last_warnings’ by ‘rlang::last_warnings’ when loading ‘tibble’
2: replacing previous import ‘ellipsis::check_dots_unnamed’ by ‘rlang::check_dots_unnamed’ when loading ‘tibble’
3: replacing previous import ‘ellipsis::check_dots_used’ by ‘rlang::check_dots_used’ when loading ‘tibble’
4: replacing previous import ‘ellipsis::check_dots_empty’ by ‘rlang::check_dots_empty’ when loading ‘tibble’
5: replacing previous import ‘lifecycle::last_warnings’ by ‘rlang::last_warnings’ when loading ‘pillar’
6: replacing previous import ‘ellipsis::check_dots_unnamed’ by ‘rlang::check_dots_unnamed’ when loading ‘pillar’
7: replacing previous import ‘ellipsis::check_dots_used’ by ‘rlang::check_dots_used’ when loading ‘pillar’
8: replacing previous import ‘ellipsis::check_dots_empty’ by ‘rlang::check_dots_empty’ when loading ‘pillar’
9: package ‘tidyverse’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
10: package ‘reticulate’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Warning message:
package ‘optparse’ was built under R version 4.1.3
Error in library(reticulate) : there is no package called ‘reticulate’
Execution halted

Regarding the alignment, I used the hg38 genome.fa file that came with Cellranger. Regarding the trimming, since 5' kits come with a shorter barcode, I adapted that. Would you perhaps have a visualization of what the PCR amplified libraries look like and what part you're trimming? To understand if I have to trim something for the regular 10x libraries as well?

Cheers, M

sergibeneyto commented 1 year ago

Hi @Thapeachydude,

sorry I had overlooked that the table_to_rds step also has runs an R script. It should work now, I removed the package installation section and changed the conda environment of the rule to the r_processing.yml file which contains all R packages needed. Let me know if the workflow runs successfully now. You would just have to clone the repository again and re-run it.

The adapter sequence that is trimmed in the workflow is the TSO sequence from 10x 3' v3 which is present in standard 10x 3' libraries. I think this sequence is also present in the generation of 5' libraries but not in the final library so you could omit this step overall I think for your 10x 5' data.

Thapeachydude commented 1 year ago

Hi @sergibeneyto,

sure, no worries. I deleted to old conda environment, pulled the gitHub repo and tried to recreate it, but that seems to fail.

(base) $ conda env create -f envs/optimized10x.yml
Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
  - libxml2==2.9.12=h72842e0_0
sergibeneyto commented 1 year ago

Hi,

thanks for noticing. It seems that version 2.19.12 of libxml2 got moved to the broken conda-forge channel recently. That's why it is not found when installing the optimized10x conda environment (I had not modified the .yml file used for creating it). This seems to be an ongoing issue with conda environments at time.

I re-created the optimized10x environment and tested it on a toy dataset. You should be able to install it now. Let me know if it works.

We are currently thinking about containarizing the workflow to circumvent the use of conda, particularly if issues like this one happen more often...

Thapeachydude commented 1 year ago

Hi,

so I re-downloaded and installed everything and then ran the whole thing from scratch. We've made it to the last rule, unfortunately, there it crashes again. But I'm not sure why.

Error in rule summary_report:
    jobid: 28
    output: results/summary_reports/ULUHVV_2/ULUHVV_2_report.html
    log: results/logs/ULUHVV_2/summary_report.log (check log file(s) for error message)
    conda-env: /cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv/.snakemake/conda/dac62a26
    cluster_jobid: Submitted batch job 3480427

Error executing rule summary_report on cluster (jobid: 28, external: Submitted batch job 3480427, jobscript: /cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv/.snakemake/tmp.hliwmt5h/snakejob.summary_report.28.sh). For error details see the cluster log and the log files of the involved rule(s).

The results/logs/ULUHVV_2/summary_report.log file is unfortuantely empty.

Output in the snakemake.err file:

CalledProcessError in line 671 of /cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv/Snakefile:
Command 'source /cluster/work/moor/marcel_prj/conda_env/optimized10x/bin/activate '/cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv/.snakemake/conda/dac62a26'; set -euo pipefail;  Rscript --vanilla -e 'rmarkdown::render("/cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv/.snakemake/scripts/tmpu8cqgegq.summary_report.Rmd", output_file="/cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv/results/summary_reports/ULUHVV_2/ULUHVV_2_report.html", quiet=TRUE, knit_root_dir = "/cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv", params = list(rmd="/cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv/.snakemake/scripts/tmpu8cqgegq.summary_report.Rmd"))'' returned non-zero exit status 1.
  File "/cluster/work/moor/marcel_prj/conda_env/optimized10x/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2317, in run_wrapper
  File "/cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv/Snakefile", line 671, in __rule_summary_report
  File "/cluster/work/moor/marcel_prj/conda_env/optimized10x/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 566, in _callback
  File "/cluster/work/moor/marcel_prj/conda_env/optimized10x/lib/python3.10/concurrent/futures/thread.py", line 58, in run
  File "/cluster/work/moor/marcel_prj/conda_env/optimized10x/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 552, in cached_or_run
  File "/cluster/work/moor/marcel_prj/conda_env/optimized10x/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2348, in run_wrapper

Cheers, M

sergibeneyto commented 1 year ago

Hi,

could you send me the complete snakemake .out and .err files? Errors when running Rmarkdowns in Snakemake are redirected to the .err file of the workflow.

Thapeachydude commented 1 year ago

Hi,

sure the files you can find below. I had to change the extensions to be able to upload them.

snakemake_out.txt snakemake_err.txt

sergibeneyto commented 1 year ago

Hi,

could you load count table .rds file data/{patient}/mutation_counts/count_table/{patient}_count_table.rds in R and check that not all cells have dropout status?

The summary report fails to generate the last plot when the count_table is filtered for covered cells only.

Thapeachydude commented 1 year ago

Hi,

I can check, but I think looking at some of the plots the summary generated the cells were mainly dropouts. What constitutes a dropout? The position tested is not sufficiently covered? And if so is this per mutation or over all mutations?

Cheers, M

sergibeneyto commented 1 year ago

Hi,

I consider a cell as dropout for a particular mutation when there are no reads covering the specific position (neither reference nor mutant reads). In the final table there is one entry per cell and mutation, so each mutation is considered independently.

Do you know if your mutations are very far away from the 5'-end? If they are close to the 3'-end without targeted amplification of the sites the coverage will be very poor or you may not get any cell covered. You could quickly check by loading the bam file data/{patient}/align_reads/gene_tagged_aligned.bam in a genome browser like IGV and check whether there are any reads covering your mutations of interest.

In case you see reads in the bam file a possible reason why all cells are dropout for your mutations might be because the parameter config["read_deduplication"]["min_reads_umi"] is set to 2 by default in the config.yml file. This means that for a UMI to be counted it needs to be supported by at least 2 reads. This step helps reduce false positives that come from PCR amplifications. If you did shallow sequencing of the default 10x libraries, most UMIs will only be supported by 1 read and therefore be filtered out in the workflow. You could decrease the parameter to 1 then, although that would increase the potential risk of false positives.

I hope this helps.

Thapeachydude commented 1 year ago

Hi, sorry its been a while. I hope things have been well : )

Since we were running the tool with only 1-2 test variants, all cells were drop-outs and the thing therefore didn't finish. I've since tried re-running it with a more complete list of known mutations, but it seems to crash during the split_bam step.

[E::hts_open_format] Failed to open file data/ULUHVV_2/mutation_counts/count_table/temp_bams/split_bams/TAGTTGGAGCGTTCCG-1.bam
[E::hts_open_format] Failed to open file data/ULUHVV_2/mutation_counts/count_table/temp_bams/split_bams/CCCATACGTCCCTACT-1.bam
[E::hts_open_format] Failed to open file data/ULUHVV_2/mutation_counts/count_table/temp_bams/split_bams/ACGGAGATCCTCAATT-1.bam
[E::hts_open_format] Failed to open file data/ULUHVV_2/mutation_counts/count_table/temp_bams/split_bams/GATCGATGTAACGACG-1.bam
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/cluster/CloneTracer/library_processing/nuclear-snv/.snakemake/conda/ef674e3a/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/cluster/CloneTracer/library_processing/nuclear-snv/.snakemake/conda/ef674e3a/lib/python3.7/multiprocessing/pool.py", line 47, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "scripts/split_bam_single_files.py", line 40, in write_bam
    bam_cell_files = [pysam.AlignmentFile(file, "wb", template = bam) for file in outfiles]
  File "scripts/split_bam_single_files.py", line 40, in <listcomp>
    bam_cell_files = [pysam.AlignmentFile(file, "wb", template = bam) for file in outfiles]
  File "pysam/libcalignmentfile.pyx", line 741, in pysam.libcalignmentfile.AlignmentFile.__cinit__
  File "pysam/libcalignmentfile.pyx", line 914, in pysam.libcalignmentfile.AlignmentFile._open
OSError: [Errno 24] could not open alignment file `data/ULUHVV_2/mutation_counts/count_table/temp_bams/split_bams/TAGTTGGAGCGTTCCG-1.bam`: Too many open files
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "scripts/split_bam_single_files.py", line 172, in <module>
    [bam_file] * cores))
  File "/cluster/CloneTracer/library_processing/nuclear-snv/.snakemake/conda/ef674e3a/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/cluster/CloneTracer/library_processing/nuclear-snv/.snakemake/conda/ef674e3a/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
OSError: [Errno 24] could not open alignment file `data/ULUHVV_2/mutation_counts/count_table/temp_bams/split_bams/TAGTTGGAGCGTTCCG-1.bam`: Too many open files

Cheers, M

sergibeneyto commented 1 year ago

Hi,

there seems to be a problem with too many opened files at the same time. Some operative systems have a "low" limit and the split_bam rule can result in many opened files at the same time.

In config["read_deduplication"]["max_open_files"] you can set the maximum number of files opened simultaneously. The default is 10000, but this may exceed the limit of your system. I would recommend to check what is the maximum open files your system allows and change this parameter accordingly.

Let me know if this fixes the issue.

Best, Sergi

Thapeachydude commented 1 year ago

Hi Sergi,

many thanks for the quick reply! I've had to play around with the number of cores and max number of files. Oddly enough, at some point it continued with n_cores = 1 and max_open_files = 2.

It then crashes at "combine_bams" with:

Traceback (most recent call last):
  File "scripts/merge_single_bams.py", line 146, in <module>
    shell_merge = "samtools merge -n" + " " + merged_bam + " " + " ".join(temp_bams)
NameError: name 'temp_bams' is not defined

However, I don't think the issue is at this step. When I look into the data/<patientID>/mutation_counts/count_table/temp_bams folder, I don't see the processed directory, which should be the input to that rule and the split_bams directory exists, but is empty.

Best, M

sergibeneyto commented 1 year ago

Hi, as it's currently written the rule split_bam requires at least 2 cores to work correctly. Can you try to increase the number of cores to at least 2?

I am also surprised that your operating system only allows a maximum of 2 open files...

If this doesn't work for you, let me know and I will re-write the split_bam_single_files.py script so that it can be run with 1 core only.

Thapeachydude commented 1 year ago

Hi,

I seem to be unable to run this script with more than 1 core (in it's current form). Either the number of open files is too large (see error above) or I get:

   Traceback (most recent call last):
  File "scripts/split_bam_single_files.py", line 134, in <module>
    iterations = math.ceil(len(final_barcodes)/files_core)
ZeroDivisionError: division by zero

Having the same issue with the mitochondria mut. workflow as well.

sergibeneyto commented 1 year ago

Hi,

sorry for the late reply. I made the split_bam files scripts compatible with only 1 core and 2 maximum opened files. Please try it and let me know if it works now.

It may be quite slow, particularly for the mitochondrial libraries, because it will have to iterate through the entire BAM file n = num_cells/max_opened_files times.

Best, Sergi

Thapeachydude commented 1 year ago

Hi,

so... it seems to run now until combine_bams.

Traceback (most recent call last):
  File "scripts/merge_single_bams.py", line 146, in <module>
    shell_merge = "samtools merge -n" + " " + merged_bam + " " + " ".join(temp_bams)
NameError: name 'temp_bams' is not defined

I'm not sure why it complains about the temp_bams. The temp_bams/processed directory contains the required bam files.

sergibeneyto commented 1 year ago

Hi sorry,

I forgot to fix the merge_single_bam.py scripts for 1 core. It should be fixed now for both nuclear and mitochondrial workflows

Can you give it a try?

Thapeachydude commented 1 year ago

Hi, no worries.

The merge single bam part now runs for a bit, but then crashes with

[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10317_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10318_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10318_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10319_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10319_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10320_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10320_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10321_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10321_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10322_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10322_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10323_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10323_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10324_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10324_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10325_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10325_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10326_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10326_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10327_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10327_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10328_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10328_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10329_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10329_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10330_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10330_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10331_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10331_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10332_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10332_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10333_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10333_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10334_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10334_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10335_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10335_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10336_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10336_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10337_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10337_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10338_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10338_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10339_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10339_temp.bam
[E::hts_open] fail to open file 'data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10340_temp.bam'
[bam_merge_core] fail to open file data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10340_temp.bam
mv: cannot stat ‘data/ULUHVV_2/mutation_counts/count_table/unmapped_tagged_10341_temp.bam’: No such file or directory

Will try running the mito part now and see if that produces a similar issue.

sergibeneyto commented 1 year ago

Hi,

could you send me the complete log file? You are using cores = 1 and max_files = 2, right? Can you also check that the BAM files stored in data/ULUHVV_2/mutation_counts/count_table/temp_bams/processed directory are not empty? You could simply do samtools view bamfile and check that there are reads stored.

Thapeachydude commented 1 year ago

Yes, cores = 1, max_files = 2. The log file you can find attached.

The data/ULUHVV_2/mutation_counts/count_table/temp_bams/processed unfortunately seems to have been deleted (maybe due to the rule failure?) I could try commenting out the combine bam steps in the Snakefile and then check their size.

combine_single_bams.log

sergibeneyto commented 1 year ago

I think I found the problem in the script, can you try with the most recent versions of merge_single_bams.py?

This step deletes all temporary bams by default. I would perhaps add the flag -k True to the combine_bams rule or simply -k, so that temporary files do not get deleted if the rule crashes.

Thapeachydude commented 1 year ago

So... It seems with the latest iteration the combine_bam parts seems to have worked (at least it didn't crash there).

But it did crash with make_count_table:

/cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv/.snakemake/conda/ef674e3a/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3373: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/cluster/work/moor/loop_prj_analysis/CloneTracer/library_processing/nuclear-snv/.snakemake/conda/ef674e3a/lib/python3.7/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "scripts/make_count_table.py", line 246, in <module>
    status, coverage, miss_reads = process_indel(bam, contig, start, max_ratio, ref)
  File "scripts/make_count_table.py", line 139, in process_indel
    raw_reads = read.alignment.get_tag("cd")[read.query_position]
TypeError: array indices must be integers
sergibeneyto commented 1 year ago

Hi,

this may be due to the abscence of reads covering a particular mutation.

I modified the make_count_table.py script to account for that. Can you give it a try with the newest version?

sergibeneyto commented 7 months ago

I will close this issue. We talked at SCG 2023 and you told me that the pipeline ran successfully but you did not have reads covering the mutations.