Error when running cisDynet_snakemake/atacqc/cal_tss.py

yashi99 commented 8 months ago

Hi, when running snakemake -s ./cisDynet_snakemake/atacqc/cisDynet_snakemake.py -j 15 ，I find this error :

Traceback (most recent call last):
  File "/home/cisDynet_snakemake/atacqc/cal_tss.py", line 17, in <module>
  dis_fliter_for=dis[dis['Strand']=='+']
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/site-packages/pandas/core/frame.py", line 3761, in __getitem__
  indexer = self.columns.get_loc(key)
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/site-packages/pandas/core/indexes/range.py", line 349, in get_loc
  raise KeyError(key)
  KeyError: 'Strand'
  [Thu Oct 26 16:25:53 2023]
  Error in rule TSS_Enrichment:
    jobid: 0
  output: tss/hESC_rep1_tss.csv, tss/hESC_rep1_score.csv

  RuleException:
    CalledProcessError in line 242 of /home/cisDynet_snakemake/atacqc/cisDynet_snakemake.py:
    Command 'set -euo pipefail;  **python /home/cisDynet_snakemake/atacqc/cal_tss.py cut_sites/hESC_rep1_q30_cut_sites.bed tss/TSS.pos tss/hESC_rep1_tss.csv tss/hESC_rep1_score.csv' returned non-zero exit status 1.**
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 2326, in run_wrapper
  File "/home/cisDynet_snakemake/atacqc/cisDynet_snakemake.py", line 242, in __rule_TSS_Enrichment
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 568, in _callback
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/concurrent/futures/thread.py", line 57, in run
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 554, in cached_or_run
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 2357, in run_wrapper
  Exiting because a job execution failed. Look above for error message

I think there may be some problem during python /home/cisDynet_snakemake/atacqc/cal_tss.py cut_sites/hESC_rep1_q30_cut_sites.bed tss/TSS.pos tss/hESC_rep1_tss.csv tss/hESC_rep1_score.csv . when I tried to run it separately, it report error as below:

(cisDynet_pipeline) python /home/storage_4/Projects/Gerline_database/ATAC-Seq/human/ATAC_integration/cisDynet_snakemake/atacqc/cal_tss.py cut_sites/hESC_rep1_q30_cut_sites.bed tss/TSS.pos test/hESC_rep1_tss.csv test/hESC_rep1_score.csv

join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
Traceback (most recent call last):
  File "/home/storage_4/Projects/Gerline_database/ATAC-Seq/human/ATAC_integration/cisDynet_snakemake/atacqc/cal_tss.py", line 17, in <module>
    dis_fliter_for=dis[dis['Strand']=='+']
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/site-packages/pandas/core/frame.py", line 3761, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/site-packages/pandas/core/indexes/range.py", line 349, in get_loc
    raise KeyError(key)
KeyError: 'Strand'

Could you please give some advice?

tzhu-bio commented 8 months ago

I had a similar error reported once, it was due to a pandas version update, you can test it first with python /home/cisDynet_snakemake/atacqc/cal_tss_old.py cut_sites/hESC_rep1_q30_cut_sites.bed tss/TSS.pos tss/hESC_rep1_tss.csv tss/hESC_rep1_score.csv alone.

By the way, you'd better re-download atacqc as I updated some of the scripts today.

yashi99 commented 8 months ago

Thank you! I tried using the cal_tss_old.py but it returned the same error:

python /home/cisDynet_snakemake/atacqc/cal_tss_old.py cut_sites/hESC_rep1_q30_cut_sites.bed tss/TSS.pos test/hESC_rep1_tss.csv test/hESC_rep1_score.csv
join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
Traceback (most recent call last):
  File "/home/cisDynet_snakemake/atacqc/cal_tss_old.py", line 17, in <module>
    dis_fliter_for=dis[dis['Strand']=='+']
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/site-packages/pandas/core/frame.py", line 3761, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/systools/anaconda3/envs/cisDynet_pipeline/lib/python3.8/site-packages/pandas/core/indexes/range.py", line 349, in get_loc
    raise KeyError(key)
KeyError: 'Strand'

And then I tried the cal_tss_old.py and cal_tss.py from the re-downloaded atacqc. They did not work with the same errors. Then I just run the re-downloaded cisDynet_snakemake.py in a new folder, and got the error:

snakemake -s ../cisDynet_snakemake2/atacqc/cisDynet_snakemake.py -j 15

SyntaxError in line 75 of /home/cisDynet_snakemake2/atacqc/cisDynet_snakemake.py:
invalid syntax

Maybe I should first solve the problem of cal_tss.py or cal_tss_old.py?

tzhu-bio commented 8 months ago

Can you print a few lines from the TSS.pos file in the tss directory (created by this pipeline)? Also, I just updated cisDynet_snakemake.py.

yashi99 commented 8 months ago

Of course!

(cisDynet_pipeline) head -n 50 TSS.pos J01415.2 577 577 1 + gene J01415.2 648 648 1 + gene J01415.2 1602 1602 1 + gene J01415.2 1671 1671 1 + gene J01415.2 3230 3230 1 + gene J01415.2 3307 3307 1 + gene J01415.2 4263 4263 1 + gene J01415.2 4400 4400 1 - gene J01415.2 4402 4402 1 + gene J01415.2 4470 4470 1 + gene J01415.2 5512 5512 1 + gene J01415.2 5655 5655 1 - gene J01415.2 5729 5729 1 - gene J01415.2 5826 5826 1 - gene J01415.2 5891 5891 1 - gene J01415.2 5904 5904 1 + gene J01415.2 7514 7514 1 - gene J01415.2 7518 7518 1 + gene J01415.2 7586 7586 1 + gene J01415.2 8295 8295 1 + gene J01415.2 8366 8366 1 + gene J01415.2 8527 8527 1 + gene J01415.2 9207 9207 1 + gene J01415.2 9991 9991 1 + gene J01415.2 10059 10059 1 + gene

tzhu-bio commented 8 months ago

Can you run the following code and then take a screenshot to show me the output?

import pyranges as pr
import pandas as pd
tss_file=pd.read_table('tss/TSS.pos',usecols=[0,1,2,4],skiprows=1,names='Chromosome Start End Strand'.split())
tss_gr=pr.PyRanges(tss_file)
cut=pd.read_table('cut_sites/hESC_rep1_q30_cut_sites.bed', usecols=[0,1,2], names='Chromosome Start End'.split())
cut_gr=pr.PyRanges(cut)
dis=cut_gr.nearest(tss_gr).as_df()
dis.head()

yashi99 commented 8 months ago

Yes, and here are the results:

>>> import pyranges as pr
>>> import pandas as pd
>>> tss_file=pd.read_table('tss/TSS.pos',usecols=[0,1,2,4],skiprows=1,names='Chromosome Start End Strand'.split())
>>> tss_gr=pr.PyRanges(tss_file)
>>> cut=pd.read_table('cut_sites/hESC_rep1_q30_cut_sites.bed', usecols=[0,1,2], names='Chromosome Start End'.split())
>>> cut_gr=pr.PyRanges(cut)
>>> dis=cut_gr.nearest(tss_gr).as_df()
join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
>>> dis.head()
Empty DataFrame
Columns: []
Index: []
>>>

tzhu-bio commented 8 months ago

Can you show a few lines of your cut_sites/hESC_rep1_q30_cut_sites.bed?

yashi99 commented 8 months ago

Yes:

(cisDynet_pipeline) head cut_sites/hESC_rep1_q30_cut_sites.bed 
chr1    10007   10008   1
chr1    10124   10125   1
chr1    10030   10031   1
chr1    10047   10048   1
chr1    10113   10114   1
chr1    10049   10050   1
chr1    10138   10139   1
chr1    10138   10139   1
chr1    10138   10139   1
chr1    17207   17208   1

yashi99 commented 8 months ago

And here are some head lines of each file


>>> tss_gr
+--------------+-----------+-----------+--------------+
| Chromosome   | Start     | End       | Strand       |
| (category)   | (int64)   | (int64)   | (category)   |
|--------------+-----------+-----------+--------------|
| J01415.2     | 648       | 648       | +            |
| J01415.2     | 1602      | 1602      | +            |
| J01415.2     | 1671      | 1671      | +            |
| J01415.2     | 3230      | 3230      | +            |
| ...          | ...       | ...       | ...          |
| J01415.2     | 7514      | 7514      | -            |
| J01415.2     | 14673     | 14673     | -            |
| J01415.2     | 14742     | 14742     | -            |
| J01415.2     | 16023     | 16023     | -            |
+--------------+-----------+-----------+--------------+
Stranded PyRanges object has 36 rows and 4 columns from 1 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
>>> cut
                    Chromosome   Start     End
0                         chr1   10007   10008
1                         chr1   10124   10125
2                         chr1   10030   10031
3                         chr1   10047   10048
4                         chr1   10113   10114
...                        ...     ...     ...
18638934  chr19_GL949752v1_alt  204499  204500
18638935  chr19_GL949752v1_alt  204595  204596
18638936  chr19_GL949753v2_alt  257133  257134
18638937  chr19_GL949753v2_alt  257164  257165
18638938  chr19_KI270938v1_alt  332376  332377

[18638939 rows x 3 columns]
>>> dis
Empty DataFrame
Columns: []
Index: []
>>> cut_gr.nearest(tss_gr)
join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
Empty PyRanges
>>> cut_gr
+------------------------+-----------+-----------+
| Chromosome             | Start     | End       |
| (category)             | (int64)   | (int64)   |
|------------------------+-----------+-----------|
| chr1                   | 10007     | 10008     |
| chr1                   | 10124     | 10125     |
| chr1                   | 10030     | 10031     |
| chr1                   | 10047     | 10048     |
| ...                    | ...       | ...       |
| chrY_KI270740v1_random | 18205     | 18206     |
| chrY_KI270740v1_random | 18419     | 18420     |
| chrY_KI270740v1_random | 20526     | 20527     |
| chrY_KI270740v1_random | 20711     | 20712     |
+------------------------+-----------+-----------+
Unstranded PyRanges object has 18,638,939 rows and 3 columns from 296 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

tzhu-bio commented 8 months ago

Is J01415.2 a chromosome number?发自我的 iPhone在 2023年10月27日，14:12，yashi99 @.***> 写道： Yes: (cisDynet_pipeline) head cut_sites/hESC_rep1_q30_cut_sites.bed chr1 10007 10008 1 chr1 10124 10125 1 chr1 10030 10031 1 chr1 10047 10048 1 chr1 10113 10114 1 chr1 10049 10050 1 chr1 10138 10139 1 chr1 10138 10139 1 chr1 10138 10139 1 chr1 17207 17208 1

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

yashi99 commented 8 months ago

No, I am not sure why it shows JJ01415.2 in TSS.pos，so I search "J01415.2" in the .fna and .gff recorded in config.yaml:

(base) cat /home/data/genome/BWA_RNA_UCSCfa_hg38_index/GCA_000001405.15_GRCh38_full_analysis_set.fna | grep J01415
>chrM  AC:J01415.2  gi:113200490  LN:16569  rl:Mitochondrion  M5:c68f52674c9fb33aef52dcf399755519  AS:GRCh38  tp:circular

(base) cat /home/data/genome/BWA_RNA_UCSCfa_hg38_index/GCA_000001405.15_GRCh38_genomic.gff | grep J01415.2 | head
##sequence-region J01415.2 1 16569
J01415.2    Genbank region  1   16569   .   +   .   ID=id479;Dbxref=taxon:9606;Is_circular=true;Name=MT;country=United Kingdom: Great Britain;gbkey=Src;genome=mitochondrion;isolation-source=caucasian;mol_type=genomic DNA;note=this is the rCRS;tissue-type=placenta
J01415.2    Genbank D_loop  16024   17145   .   -   .   ID=id480;gbkey=D-loop
J01415.2    Genbank gene    577 647 .   +   .   ID=gene0;Dbxref=GeneID:4558;Name=TRNF;gbkey=Gene;gene=TRNF;gene_synonym=MTTF
J01415.2    Genbank tRNA    577 647 .   +   .   ID=rna0;Parent=gene0;Note=NAR: 1455;anticodon=(pos:611..613);codons=1;gbkey=tRNA;gene=TRNF;product=tRNA-Phe
J01415.2    Genbank exon    577 647 .   +   .   ID=id481;Parent=rna0;Note=NAR: 1455;anticodon=(pos:611..613);codons=1;gbkey=tRNA;gene=TRNF;product=tRNA-Phe
J01415.2    Genbank gene    648 1601    .   +   .   ID=gene1;Dbxref=GeneID:4549;Name=RNR1;gbkey=Gene;gene=RNR1;gene_synonym=MTRNR1
J01415.2    Genbank rRNA    648 1601    .   +   .   ID=rna1;Parent=gene1;gbkey=rRNA;gene=RNR1;product=12S ribosomal RNA
J01415.2    Genbank exon    648 1601    .   +   .   ID=id482;Parent=rna1;gbkey=rRNA;gene=RNR1;product=12S ribosomal RNA
J01415.2    Genbank gene    1602    1670    .   +   .   ID=gene2;Dbxref=GeneID:4577;Name=TRNV;gbkey=Gene;gene=TRNV;gene_synonym=MTTV

It seems that those fna and gff downloaded from UCSC are questionable, so I will firstly turn to use my usual fa and gtf to build bwa index to run snakemake. Also I notice that in this .fna，J01415.2 seems to be another name of chrM？Is it because I set "organell_chr: chrM" in my config.yaml? I am still not sure what this parameter means. Is it means "only retain chrM results" or "only discard chrM results"?

tzhu-bio commented 8 months ago

Please send me part of the data (10000 rows of cut_sites/hESC_rep1_q30_cut_sites.bed ) and your gff file to my email (tzhubio@gmail.com) and I'll take a look at it specifically.

yashi99 commented 8 months ago

I have sent you the questionable files, and then I tried to run snakemake using my usual .fa and .gtf as mentioned before, and then I successfully got the results! Thank you so much!

tzhu-bio / cisDynet_snakemake

Error when running cisDynet_snakemake/atacqc/cal_tss.py #3