rajewsky-lab / spacemake

Other
39 stars 11 forks source link

error in create_spatial_barcode_file: trying to merge on int64 and object columns #115

Closed gesavoigt closed 5 months ago

gesavoigt commented 6 months ago

I am trying to process seq-scope data and am getting an error message in create_spatial_barcode_file:

rule create_spatial_barcode_file:
    input: /gpfs/bwfor/work/ws/hd_fz305-seqscope/data/cho2021/results_spacemake/puck_barcode_file.txt, projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/out_readcounts_prealigned.txt.gz
    output: projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/puck_barcode_files/spatial_barcodes_puck_barcode_file.csv
    jobid: 2
    wildcards: project_id=cho2021_liver, sample_id=SRR14082756, puck_barcode_file_id=puck_barcode_file

^[[33mJob counts:
        count   jobs
        1       create_spatial_barcode_file
        1^[[0m
^[[32m[Wed May 15 16:25:24 2024]^[[0m
^[[31mError in rule create_spatial_barcode_file:^[[0m
^[[31m    jobid: 0^[[0m
^[[31m    output: projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/puck_barcode_files/spatial_barcodes_puck_barcode_file.csv^[[0m
^[[31m^[[0m
^[[31mRuleException:
ValueError in line 429 of /home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/spacemake/snakemake/main.smk:
You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/spacemake/snakemake/main.smk", line 429, in __rule_create_spatial_barcode_file
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 107, in merge
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 704, in __init__
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 1257, in _maybe_coerce_merge_keys
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/concurrent/futures/thread.py", line 58, in run
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper^[[0m
^[[31mExiting because a job execution failed. Look above for error message^[[0m
output provided by 'mapping.smk' module (via 'get_mapped_BAM_output'): 'projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/final.polyA_adapter_trimmed.bam'
output provided by 'mapping.smk' module (via 'get_star_unloaded_flag'): 'species_data/mm10/genome/star_index/genomeUnload.done'
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/bwfor/work/ws/hd_fz305-seqscope/data/cho2021/results_spacemake/.snakemake/log/2024-05-15T162250.515379.snakemake.log
output provided by 'mapping.smk' module (via 'get_mapped_BAM_output'): 'projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/final.polyA_adapter_trimmed.bam'
output provided by 'mapping.smk' module (via 'get_star_unloaded_flag'): 'species_data/mm10/genome/star_index/genomeUnload.done'
output provided by 'mapping.smk' module (via 'get_mapped_BAM_output'): 'projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/final.polyA_adapter_trimmed.bam'
output provided by 'mapping.smk' module (via 'get_star_unloaded_flag'): 'species_data/mm10/genome/star_index/genomeUnload.done'
ERROR: SpacemakeError
an error occurred while snakemake() ran

My initial idea was that this was an issue with the pandas version, but downgrading didn't solve it: As specified in the spacemake environment.yaml, pandas was 1.5.1. I downgraded to pandas=1.4.0, but resuming the pipeline threw the same error. Pandas 1.3.0 cannot be installed due to dependencies issues.

Any ideas on how to fix/circumvent this issue?

I am using spacemake v0.7.8. The data is from Cho et al., 2021, specifically using NCBI SRR14082756 as 2nd-seq data and DraI-100pM-mbcore-RD2.fastq.gz as 1st-seq data. Let me know in case you need any other information.

nukappa commented 6 months ago

Hi @gesavoigt , have you already processed the 1st-seq data to extract the barcodes? How does your puck_barcode_file.txt file look like?

gesavoigt commented 6 months ago

Hi @nukappa, thank you for the suggestion! It was indeed an issue with the puck_barcode_file.txt. Its head looks like this now and doesn't produce the error anymore:

NAGACGACTCTCCCCGCTATAGATN,11019698,11011015
NTCAGCAAGAAGCCCCATCGAGATN,11018957,11011016
NTAATCAATACGCCGCGGTTAGATN,110112031,11011016
NACTCCCTCCACTCTACTCCAGATN,11019724,11011016

Unfortunately, I now get stuck later on. It seems as though the DGE is empty. The error message I got is this:

[Fri May 24 13:28:32 2024]
rule create_h5ad_dge:
    input: projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/dge/dge.exon.polyA_adapter_trimmed.1000_beads_no_spatial_data.txt.gz, projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/dge/dge.exon.polyA_adapter_trimmed.1000_beads_no_spatial_data.summary.txt, projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/puck_barcode_files_summary.csv
    output: projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/dge/dge.exon.polyA_adapter_trimmed.1000_beads_no_spatial_data.h5ad, projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/dge/dge.exon.polyA_adapter_trimmed.1000_beads_no_spatial_data.obs.csv
    jobid: 18
    wildcards: project_id=cho2021_liver, sample_id=SRR14082756, data_root_type=complete_data, downsampling_percentage=, dge_type=.exon, dge_cleaned=, polyA_adapter_trimmed=.polyA_adapter_trimmed, mm_included=, n_beads=1000, puck_barcode_file_id=no_spatial_data, is_external=

^[[33mJob counts:
        count   jobs
        1       create_h5ad_dge
        1^[[0m
^[[32m[Fri May 24 13:28:34 2024]^[[0m
^[[31mError in rule create_h5ad_dge:^[[0m
^[[31m    jobid: 0^[[0m
^[[31m    output: projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/dge/dge.exon.polyA_adapter_trimmed.1000_beads_puck_barcode_file.h5ad, projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/dge/dge.exon.polyA_adapter_trimmed.1000_beads_puck_barcode_file.obs.csv^[[0m
^[[31m^[[0m
^[[31mRuleException:
AttributeError in line 491 of /home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/spacemake/snakemake/main.smk:
'NoneType' object has no attribute 'shape'
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/spacemake/snakemake/main.smk", line 491, in __rule_create_h5ad_dge
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/spacemake/preprocess/dge.py", line 139, in dge_to_sparse_adata
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/concurrent/futures/thread.py", line 58, in run
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper^[[0m
^[[31mExiting because a job execution failed. Look above for error message^[[0m
output provided by 'mapping.smk' module (via 'get_mapped_BAM_output'): 'projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/final.polyA_adapter_trimmed.bam'
output provided by 'mapping.smk' module (via 'get_star_unloaded_flag'): 'species_data/mm10/genome/star_index/genomeUnload.done'
need to add mt-missing because no mitochondrial stuff was among the genes for annotation
Job failed, going on with independent jobs.
^[[33mJob counts:
        count   jobs
        1       create_h5ad_dge
        1^[[0m
^[[32m[Fri May 24 13:28:35 2024]^[[0m
^[[31mError in rule create_h5ad_dge:^[[0m
^[[31m    jobid: 0^[[0m
^[[31m    output: projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/dge/dge.exon.polyA_adapter_trimmed.1000_beads_no_spatial_data.h5ad, projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/dge/dge.exon.polyA_adapter_trimmed.1000_beads_no_spatial_data.obs.csv^[[0m
^[[31m^[[0m
^[[31mRuleException:
AttributeError in line 491 of /home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/spacemake/snakemake/main.smk:
'NoneType' object has no attribute 'shape'
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2330, in run_wrapper
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/spacemake/snakemake/main.smk", line 491, in __rule_create_h5ad_dge
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/spacemake/preprocess/dge.py", line 139, in dge_to_sparse_adata
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 569, in _callback
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/concurrent/futures/thread.py", line 58, in run
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 555, in cached_or_run
  File "/home/hd/hd_hd/hd_fz305/miniconda3/envs/spacemake/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2362, in run_wrapper^[[0m
^[[31mExiting because a job execution failed. Look above for error message^[[0m
output provided by 'mapping.smk' module (via 'get_mapped_BAM_output'): 'projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/final.polyA_adapter_trimmed.bam'
output provided by 'mapping.smk' module (via 'get_star_unloaded_flag'): 'species_data/mm10/genome/star_index/genomeUnload.done'
need to add mt-missing because no mitochondrial stuff was among the genes for annotation
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/bwfor/work/ws/hd_fz305-seqscope/data/cho2021/results_spacemake/.snakemake/log/2024-05-24T121941.357375.snakemake.log
output provided by 'mapping.smk' module (via 'get_mapped_BAM_output'): 'projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/final.polyA_adapter_trimmed.bam'
output provided by 'mapping.smk' module (via 'get_star_unloaded_flag'): 'species_data/mm10/genome/star_index/genomeUnload.done'
output provided by 'mapping.smk' module (via 'get_mapped_BAM_output'): 'projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/final.polyA_adapter_trimmed.bam'
output provided by 'mapping.smk' module (via 'get_star_unloaded_flag'): 'species_data/mm10/genome/star_index/genomeUnload.done'
output provided by 'mapping.smk' module (via 'get_mapped_BAM_output'): 'projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/final.polyA_adapter_trimmed.bam'
output provided by 'mapping.smk' module (via 'get_star_unloaded_flag'): 'species_data/mm10/genome/star_index/genomeUnload.done'
ERROR: SpacemakeError

If I understand correctly, one of the inputs is dge.exon.polyA_adapter_trimmed.1000_beads_no_spatial_data.summary.txt, which contains only zeros:

## htsjdk.samtools.metrics.StringHeader
# DigitalExpression INPUT=projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/final.polyA_adapter_trimmed.bam SUMMARY=projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/dge/dge.exon.polyA_adapter_trimmed.1000_beads_no_spatial_data.summary.txt OUTPUT=projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/dge/dge.exon.polyA_adapter_trimmed.1000_beads_no_spatial_data.txt.gz CELL_BARCODE_TAG=CB MOLECULAR_BARCODE_TAG=MI CELL_BC_FILE=projects/cho2021_liver/processed_data/SRR14082756/illumina/complete_data/topBarcodes.polyA_adapter_trimmed.1000_beads.txt TMP_DIR=[/tmp]    OUTPUT_READS_INSTEAD=false OMIT_MISSING_CELLS=false EDIT_DISTANCE=1 READ_MQ=10 MIN_BC_READ_THRESHOLD=0 USE_STRAND_INFO=true RARE_UMI_FILTER_THRESHOLD=0.0 GENE_NAME_TAG=gn GENE_STRAND_TAG=gs GENE_FUNCTION_TAG=gf STRAND_STRATEGY=SENSE LOCUS_FUNCTION_LIST=[CODING, UTR] VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri May 24 13:24:15 CEST 2024

## METRICS CLASS        org.broadinstitute.dropseqrna.barnyard.DigitalExpression$DESummary
CELL_BARCODE    NUM_GENIC_READS NUM_TRANSCRIPTS NUM_GENES
AGGGTAGAAAGGGAGATAAG    0       0       0
CTCTCTCTCTCTCTCTCTCT    0       0       0
GGCTTAGTCTTCCGGCTGTG    0       0       0

Other related files, such as final.polyA_adapter_trimmed.bam -> genome.STAR.bam, topBarcodes.polyA_adapter_trimmed.1000_beads.txt, dge.exon.polyA_adapter_trimmed.1000_beads_no_spatial_data.txt.gz & dge.exon.polyA_adapter_trimmed.1000_beads_puck_barcode_file.txt.gz have data (let me know if you need their heads, as this is already a very long post). On a similar note, the spatial_barcodes_puck_barcode_file.csv file only contains a header, too:

cell_bc,x_pos,y_pos

Do you have any idea what might be causing this issue? I would appreciate any help in debugging.

nukappa commented 5 months ago

hi @gesavoigt , does your puck_barcode_file also contain the header: cell_bc,x_pos,y_pos? Since your genome.star.bam is populated with mapped reads, and the quantification for the top1000 barcodes worked, it seems there's something wrong with your puck barcode file and data doesn't match to it.

gesavoigt commented 5 months ago

Hi @nukappa, I forgot to include the header but it is barcode,xcoord,ycoord. From what I found in the documentation, this should work as well but please correct me if I am wrong.

I was wondering if it was an issue with the read trimming (I previously included a constant region in the barcode, in the above post AGATN), which I fixed but continue to get the same error. Does any other processing, such as trimming, need to be done to the 2nd-seq file? As an example, the first 4 lines in read 1 & 2 look like this (resp.):

NGAAGACACATGGCCTAATTTCTTGTGACTACAGCACCCTCGACTCTCGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
+SRR14082756.1 1 length=151
#AAAF7JJFJFJJFFJ<JFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
NTTGTTGCCATATATTATAATAAATGCTGCACAGAAAATGTAAATAAACACTTAGTTAAAAATCCTCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+SRR14082756.1 1 length=151
#AAAFFFJJJ7JAJ<AJ<JJAJJJ<A7-<7F-F<FJJJ---JJJFFJJFJF7-JF--<AJFJ--777FJJJJJJJJJJJJJJJJJJJJJFFJJJFJJJJJJJJFAJF<A<F<FFFJJFJF<-777AAFJFJJJF--7AFFA--7A<F<F<-

Also, is the fact that there seem to be no mitochondrial genes of relevance (now)?

Many thanks in advance for your attention and suggestions!

nukappa commented 5 months ago

Hi @gesavoigt , did you solve this?

If not, could you share here the star.Log.final to see if reads actually map to the genome? Could it be the read files R1 and R2 are swapped?

gesavoigt commented 5 months ago

Hi @nukappa, I just found the issue, it was actually with the reference genome. Thanks for helping out even though it turned out not to be about the spacemake pipeline, I just could not trace it back through the error message that was given. Maybe at least it will help somebody else out.