sjroth / ARTDeco

MIT License
15 stars 7 forks source link

KeyError: '1' when running preprocessing mode #12

Closed PacoRM24 closed 1 year ago

PacoRM24 commented 1 year ago

Hello, I'm running the preprocessing mode using the next command:

ARTDeco -mode preprocess -gtf-file $GTF_FILE -chrom-sizes-file $CHROM_SIZES -layout PE -stranded True -orientation Forward -meta-file $META_FILE -comparisons-file $COMPARISONS_FILE

And I'm getting the following error:

Running preprocess mode... Loading ARTDeco file structure... Reformatted meta file exists... Reformatted comparisons file exists... ARTDeco will generate the following files: ./preprocess_files/readthrough.bed ./preprocess_files/Pancreatic_ControlR1 ./preprocess_files/Pancreatic_THZ1R1 ./preprocess_files/Pancreatic_THZ1R2 ./preprocess_files/Pancreatic_ControlR2 ./preprocess_files/Pancreatic_TPLR1 ./preprocess_files/read_in.bed ./preprocess_files/Pancreatic_TPLR2 BAM file format needed... Checking... Will infer if not user-specified. BAM files specified as paired-end... BAM files specified as stranded... BAM files specified as forward-strand oriented... Summarizing BAM file stats... /share/apps/External/Python-3.6.6/lib/python3.6/site-packages/rpy2/robjects/pandas2ri.py:14: FutureWarning: pandas.core.index is deprecated and will be removed in a future version. The public classes are available in the top-level namespace. from pandas.core.index import Index as PandasIndex 6 Experiments Files are Paired-End, Strand-Specific, Forward-strand oriented Experiment Total Reads Mapped Reads ./Pancreatic_ControlR2.bam 101003393 94430201 ./Pancreatic_THZ1R1.bam 91210662 82841722 ./Pancreatic_TPLR2.bam 156674434 143653997 ./Pancreatic_THZ1R2.bam 141429589 129674323 ./Pancreatic_TPLR1.bam 154332476 141352258 ./Pancreatic_ControlR1.bam 85466746 80266973 Generating read-in region BED file... Traceback (most recent call last): File "/share/apps/External/Python-3.6.6/bin/ARTDeco", line 11, in load_entry_point('ARTDeco==0.4', 'console_scripts', 'ARTDeco')() File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/ARTDeco-0.4-py3.6.egg/ARTDeco/main.py", line 424, in main File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/ARTDeco-0.4-py3.6.egg/ARTDeco/preprocess.py", line 273, in create_stranded_read_in_df File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/ARTDeco-0.4-py3.6.egg/ARTDeco/preprocess.py", line 174, in format_read_in_df File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/frame.py", line 7552, in apply return op.get_result() File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/apply.py", line 185, in get_result return self.apply_standard() File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/apply.py", line 276, in apply_standard results, res_index = self.apply_series_generator() File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/apply.py", line 305, in apply_series_generator results[i] = self.f(v) File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/ARTDeco-0.4-py3.6.egg/ARTDeco/preprocess.py", line 174, in KeyError: '1'

Before this I was trying with the following command:

ARTDeco -mode preprocess -gtf-file $GTF_FILE -chrom-sizes-file $CHROM_SIZES -layout PE -stranded True -read-in-dist 1 -readthrough-dist 5 -intergenic-min-len 100 -intergenic-max-len 15 -meta-file $META_FILE -comparisons-file $COMPARISONS_FILE

But I got the next error:

... Files are Paired-End, Strand-Specific, Forward-strand oriented Experiment Total Reads Mapped Reads ./Pancreatic_ControlR2.bam 101003393 94430201 ./Pancreatic_THZ1R1.bam 91210662 82841722 ./Pancreatic_TPLR2.bam 156674434 143653997 ./Pancreatic_THZ1R2.bam 141429589 129674323 ./Pancreatic_TPLR1.bam 154332476 141352258 ./Pancreatic_ControlR1.bam 85466746 80266973 Generating read-in region BED file... Traceback (most recent call last): File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2898, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'Max Len'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/generic.py", line 3576, in _set_item loc = self._info_axis.get_loc(key) File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2900, in get_loc raise KeyError(key) from err KeyError: 'Max Len'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/share/apps/External/Python-3.6.6/bin/ARTDeco", line 11, in load_entry_point('ARTDeco==0.4', 'console_scripts', 'ARTDeco')() File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/ARTDeco-0.4-py3.6.egg/ARTDeco/main.py", line 424, in main File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/ARTDeco-0.4-py3.6.egg/ARTDeco/preprocess.py", line 273, in creat e_stranded_read_in_df File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/ARTDeco-0.4-py3.6.egg/ARTDeco/preprocess.py", line 174, in forma t_read_in_df File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/frame.py", line 3044, in setitem self._set_item(key, value) File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/frame.py", line 3121, in _set_item NDFrame._set_item(self, key, value) File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/generic.py", line 3579, in _set_item self._mgr.insert(len(self._info_axis), key, value) File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/internals/managers.py", line 1198, in insert block = make_block(values=value, ndim=self.ndim, placement=slice(loc, loc + 1)) File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 2744, in make_block return klass(values, ndim=ndim, placement=placement) File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 2400, in init super().init(values, ndim=ndim, placement=placement) File "/share/apps/External/Python-3.6.6/lib/python3.6/site-packages/pandas/core/internals/blocks.py", line 131, in init f"Wrong number of items passed {len(self.values)}, " ValueError: Wrong number of items passed 5, placement implies 1

I don't know what I can do to fix it. Can you help me please?

sjroth commented 1 year ago

Okay. Let's try to solve this one command at a time. Can you attach your chromosome sizes file?

Also, are you running this in the ARTDeco conda environment?

PacoRM24 commented 1 year ago

Here is the chromosome sizes file that I'm using:

chr1 248956422 chr2 242193529 chr3 198295559 chr4 190214555 chr5 181538259 chr6 170805979 chr7 159345973 chr8 145138636 chr9 138394717 chr10 133797422 chr11 135086622 chr12 133275309 chr13 114364328 chr14 107043718 chr15 101991189 chr16 90338345 chr17 83257441 chr18 80373285 chr19 58617616 chr20 64444167 chr21 46709983 chr22 50818468 chrX 156040895 chrY 57227415 chrM 16569 GL000008.2 209709 GL000009.2 201709 GL000194.1 191469 GL000195.1 182896 GL000205.2 185591 GL000208.1 92689 GL000213.1 164239 GL000214.1 137718 GL000216.2 176608 GL000218.1 161147 GL000219.1 179198 GL000220.1 161802 GL000221.1 155397 GL000224.1 179693 GL000225.1 211173 GL000226.1 15008 KI270302.1 2274 KI270303.1 1942 KI270304.1 2165 KI270305.1 1472 KI270310.1 1201 KI270311.1 12399 KI270312.1 998 KI270315.1 2276 KI270316.1 1444 KI270317.1 37690 KI270320.1 4416 KI270322.1 21476 KI270329.1 1040 KI270330.1 1652 KI270333.1 2699 KI270334.1 1368 KI270335.1 1048 KI270336.1 1026 KI270337.1 1121 KI270338.1 1428 KI270340.1 1428 KI270362.1 3530 KI270363.1 1803 KI270364.1 2855 KI270366.1 8320 KI270371.1 2805 KI270372.1 1650 KI270373.1 1451 KI270374.1 2656 KI270375.1 2378 KI270376.1 1136 KI270378.1 1048 KI270379.1 1045 KI270381.1 1930 KI270382.1 4215 KI270383.1 1750 KI270384.1 1658 KI270385.1 990 KI270386.1 1788 KI270387.1 1537 KI270388.1 1216 KI270389.1 1298 KI270390.1 2387 KI270391.1 1484 KI270392.1 971 KI270393.1 1308 KI270394.1 970 KI270395.1 1143 KI270396.1 1880 KI270411.1 2646 KI270412.1 1179 KI270414.1 2489 KI270417.1 2043 KI270418.1 2145 KI270419.1 1029 KI270420.1 2321 KI270422.1 1445 KI270423.1 981 KI270424.1 2140 KI270425.1 1884 KI270429.1 1361 KI270435.1 92983 KI270438.1 112505 KI270442.1 392061 KI270448.1 7992 KI270465.1 1774 KI270466.1 1233 KI270467.1 3920 KI270468.1 4055 KI270507.1 5353 KI270508.1 1951 KI270509.1 2318 KI270510.1 2415 KI270511.1 8127 KI270512.1 22689 KI270515.1 6361 KI270516.1 1300 KI270517.1 3253 KI270518.1 2186 KI270519.1 138126 KI270521.1 7642 KI270522.1 5674 KI270528.1 2983 KI270529.1 1899 KI270530.1 2168 KI270538.1 91309 KI270539.1 993 KI270544.1 1202 KI270548.1 1599 KI270579.1 31033 KI270580.1 1553 KI270581.1 7046 KI270582.1 6504 KI270583.1 1400 KI270584.1 4513 KI270587.1 2969 KI270588.1 6158 KI270589.1 44474 KI270590.1 4685 KI270591.1 5796 KI270593.1 3041 KI270706.1 175055 KI270707.1 32032 KI270708.1 127682 KI270709.1 66860 KI270710.1 40176 KI270711.1 42210 KI270712.1 176043 KI270713.1 40745 KI270714.1 41717 KI270715.1 161471 KI270716.1 153799 KI270717.1 40062 KI270718.1 38054 KI270719.1 176845 KI270720.1 39050 KI270721.1 100316 KI270722.1 194050 KI270723.1 38115 KI270724.1 39555 KI270725.1 172810 KI270726.1 43739 KI270727.1 448248 KI270728.1 1872759 KI270729.1 280839 KI270730.1 112551 KI270731.1 150754 KI270732.1 41543 KI270733.1 179772 KI270734.1 165050 KI270735.1 42811 KI270736.1 181920 KI270737.1 103838 KI270738.1 99375 KI270739.1 73985 KI270740.1 37240 KI270741.1 157432 KI270742.1 186739 KI270743.1 210658 KI270744.1 168472 KI270745.1 41891 KI270746.1 66486 KI270747.1 198735 KI270748.1 93321 KI270749.1 158759 KI270750.1 148850 KI270751.1 150742 KI270752.1 27745 KI270753.1 62944 KI270754.1 40191 KI270755.1 36723 KI270756.1 79590 KI270757.1 71251

I'm running this ARTDeco in a cluster. I call it by using a module as show in the following command:

module load programs/artdeco

sjroth commented 1 year ago

Ah I think I know the problem! Your GTF file and your chromosome sizes file likely have different conventions for the chromosome names. To verify this, I want you to show me the output of the following command (where GTF_FILE is your GTF):

cut -f1 GTF_FILE | sort | uniq

"I'm running this ARTDeco in a cluster. I call it by using a module as show in the following command:

module load programs/artdeco"-This does not answer whether you are using the constructed Conda environment included in the repo.

N.B. You can attach files rather than copying their content so please do that in the future. The more easily I can read your comment, the more easily I can help.

PacoRM24 commented 1 year ago

Sorry I'm just a beginner in programming. I'm not quit sure if by only calling ARTDeco as a module the Conda environment is activated. How can I know this?

Here is the output of the command you requested.

1 10 11 12 13 14 15 16 17 18 19 2 20 21 22 3 4 5 6 7 8 9

!genebuild-last-updated 2022-04 transcript_id "";

!genome-build-accession GCA_000001405.28 transcript_id "";

!genome-build GRCh38.p13 transcript_id "";

!genome-date 2013-12 transcript_id "";

!genome-version GRCh38 transcript_id "";

GL000009.2 GL000194.1 GL000195.1 GL000205.2 GL000213.1 GL000216.2 GL000218.1 GL000219.1 GL000220.1 GL000225.1 KI270442.1 KI270711.1 KI270713.1 KI270721.1 KI270726.1 KI270727.1 KI270728.1 KI270731.1 KI270733.1 KI270734.1 KI270744.1 KI270750.1 MT X Y

sjroth commented 1 year ago

"Sorry I'm just a beginner in programming. I'm not quit sure if by only calling ARTDeco as a module the Conda environment is activated. How can I know this?"-Did you create the Conda environment? Let's start there.

As expected, your GTF chromosomes do not match your genome. You need to correct this before moving forward by supplying a GTF with a matching set of chromosomes.

PacoRM24 commented 1 year ago

"Did you create the Conda environment? Let's start there."- The cluster administrator did me the favour to install ARTDeco in the cluster as a module. When I call ARTDeco with the command "module load programs/artdeco", other modules are loaded too (python-3.6.6, bedops-2.4.40, R-4.1.2, homer-4.10, samtools-1.9). I understood that this creates an environment to work with ARTDeco, am I wrong?

"As expected, your GTF chromosomes do not match your genome. You need to correct this before moving forward by supplying a GTF with a matching set of chromosomes." - Thank you, I will correct it.

sjroth commented 1 year ago

"The cluster administrator did me the favour to install ARTDeco in the cluster as a module. When I call ARTDeco with the command "module load programs/artdeco", other modules are loaded too (python-3.6.6, bedops-2.4.40, R-4.1.2, homer-4.10, samtools-1.9). I understood that this creates an environment to work with ARTDeco, am I wrong?"-It sounds like they probably installed it with the conda environment.

PacoRM24 commented 1 year ago

"As expected, your GTF chromosomes do not match your genome. You need to correct this before moving forward by supplying a GTF with a matching set of chromosomes." - You were right. I have run the preprocessing mode successfully by changing the chromosomes sizes file. Thank you very much!