sjroth / ARTDeco

MIT License
15 stars 7 forks source link

Generating read-in region BED file error #26

Closed mrb20045 closed 1 month ago

mrb20045 commented 2 months ago

Hi I ran the tool and get and error "ValueError: Cannot set a DataFrame with multiple columns to the single column Max Len". The complete log is as follow.

/home/linuxbrew/.linuxbrew/opt/python@3.9/lib/python3.9/site-packages/rpy2-3.0.0-py3.9.egg/rpy2/robjects/pandas2ri.py:15: FutureWarning: pandas.core.index is deprecated and will be removed in a future version. The public classes are available in the top-level namespace. /home/linuxbrew/.linuxbrew/opt/python@3.9/lib/python3.9/site-packages/ARTDeco-0.4-py3.9.egg/ARTDeco/preprocess.py:349: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. Running preprocess mode... Loading ARTDeco file structure... Meta file properly formatted... Generating reformatted meta... Comparisons file exists... Comparisons file properly formatted... Generating reformatted comparisons... ARTDeco will generate the following files: ./preprocess_files/genes.full.bed ./preprocess_files/Sample_8 ./preprocess_files/Sample_5 ./preprocess_files/Sample_6 ./preprocess_files/Sample_11 ./preprocess_files/gene_types.txt ./preprocess_files/Sample_2 ./preprocess_files/Sample_7 ./preprocess_files/Sample_1 ./preprocess_files/Sample_12 ./preprocess_files/readthrough.bed ./preprocess_files/Sample_4 ./preprocess_files/Sample_9 ./preprocess_files/Sample_10 ./preprocess_files/genes_condensed.bed ./preprocess_files/Sample_3 ./preprocess_files/read_in.bed ./preprocess_files/gene_to_transcript.txt GTF file needed... Checking... GTF file exists... BAM file format needed... Checking... Will infer if not user-specified. BAM files specified as paired-end... BAM files specified as unstranded... No strand orientation specified... Data is unstranded... No need to infer orientation... Skipping summary of BAM file stats... Convert GTF to BED... Generating condensed genes bed... Generating read-in region BED file... Traceback (most recent call last): File "/home/linuxbrew/.linuxbrew/bin/ARTDeco", line 33, in sys.exit(load_entry_point('ARTDeco==0.4', 'console_scripts', 'ARTDeco')()) File "/home/linuxbrew/.linuxbrew/opt/python@3.9/lib/python3.9/site-packages/ARTDeco-0.4-py3.9.egg/ARTDeco/main.py", line 426, in main File "/home/linuxbrew/.linuxbrew/opt/python@3.9/lib/python3.9/site-packages/ARTDeco-0.4-py3.9.egg/ARTDeco/preprocess.py", line 356, in create_unstranded_read_in_df File "/home/linuxbrew/.linuxbrew/opt/python@3.9/lib/python3.9/site-packages/ARTDeco-0.4-py3.9.egg/ARTDeco/preprocess.py", line 174, in format_read_in_df File "/home/linuxbrew/.linuxbrew/opt/python@3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 3970, in setitem self._set_item_frame_value(key, value) File "/home/linuxbrew/.linuxbrew/opt/python@3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 4125, in _set_item_frame_value raise ValueError( ValueError: Cannot set a DataFrame with multiple columns to the single column Max Len

sjroth commented 2 months ago

Hi,

Can you provide the GTF file?

Best, Sam

On Fri, May 3, 2024 at 9:34 AM Mohammad Reza Bakhtiarizadeh < @.***> wrote:

Hi I ran the tool and get and error "ValueError: Cannot set a DataFrame with multiple columns to the single column Max Len". The complete log is as follow.

@./lib/python3.9/site-packages/rpy2-3.0.0-py3.9.egg/rpy2/robjects/pandas2ri.py:15: FutureWarning: pandas.core.index is deprecated and will be removed in a future version. The public classes are available in the top-level namespace. @./lib/python3.9/site-packages/ARTDeco-0.4-py3.9.egg/ARTDeco/preprocess.py:349: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. Running preprocess mode... Loading ARTDeco file structure... Meta file properly formatted... Generating reformatted meta... Comparisons file exists... Comparisons file properly formatted... Generating reformatted comparisons... ARTDeco will generate the following files: ./preprocess_files/genes.full.bed ./preprocess_files/Sample_8 ./preprocess_files/Sample_5 ./preprocess_files/Sample_6 ./preprocess_files/Sample_11 ./preprocess_files/gene_types.txt ./preprocess_files/Sample_2 ./preprocess_files/Sample_7 ./preprocess_files/Sample_1 ./preprocess_files/Sample_12 ./preprocess_files/readthrough.bed ./preprocess_files/Sample_4 ./preprocess_files/Sample_9 ./preprocess_files/Sample_10 ./preprocess_files/genes_condensed.bed ./preprocess_files/Sample_3 ./preprocess_files/read_in.bed ./preprocess_files/gene_to_transcript.txt GTF file needed... Checking... GTF file exists... BAM file format needed... Checking... Will infer if not user-specified. BAM files specified as paired-end... BAM files specified as unstranded... No strand orientation specified... Data is unstranded... No need to infer orientation... Skipping summary of BAM file stats... Convert GTF to BED... Generating condensed genes bed... Generating read-in region BED file... Traceback (most recent call last): File "/home/linuxbrew/.linuxbrew/bin/ARTDeco", line 33, in sys.exit(load_entry_point('ARTDeco==0.4', 'console_scripts', 'ARTDeco')()) File @./lib/python3.9/site-packages/ARTDeco-0.4-py3.9.egg/ARTDeco/main.py", line 426, in main File @./lib/python3.9/site-packages/ARTDeco-0.4-py3.9.egg/ARTDeco/preprocess.py", line 356, in create_unstranded_read_in_df File @./lib/python3.9/site-packages/ARTDeco-0.4-py3.9.egg/ARTDeco/preprocess.py", line 174, in format_read_in_df File @./lib/python3.9/site-packages/pandas/core/frame.py", line 3970, in setitem self._set_item_frame_value(key, value) File @.***/lib/python3.9/site-packages/pandas/core/frame.py", line 4125, in _set_item_frame_value raise ValueError( ValueError: Cannot set a DataFrame with multiple columns to the single column Max Len

— Reply to this email directly, view it on GitHub https://github.com/sjroth/ARTDeco/issues/26, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEVD72WSJC552VBKQ2TIQTZAM4YBAVCNFSM6AAAAABHFBHGMCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3TOMJQGY2DCMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mrb20045 commented 2 months ago

I used Ensembl GTF file and to be set with your code convert it based on your code "awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }'". A head of my GTF file is provided as follow. It is worth noting that gtf2bed works on the gtf file.

!genome-build ARS-UCD1.2 transcript_id "";

!genome-version ARS-UCD1.2 transcript_id "";

!genome-date 2018-04 transcript_id "";

!genome-build-accession GCA_002263795.2 transcript_id "";

!genebuild-last-updated 2018-11 transcript_id "";

1 ensembl gene 339070 350389 . - . gene_id "ENSBTAG00000006648"; gene_version "6"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_id ""; 1 ensembl transcript 339070 350389 . - . gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; 1 ensembl exon 350267 350389 . - . gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSBTAE00000512015"; exon_version "1"; tag "Ensembl_canonical"; 1 ensembl CDS 350267 350389 . - 0 gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSBTAP00000008737"; protein_version "6"; tag "Ensembl_canonical"; 1 ensembl start_codon 350387 350389 . - 0 gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

sjroth commented 2 months ago

Hi,

Can you please provide the actual file or a link to where it was downloaded from? I suspect this is either a GTF conversion issue or a software version issue.

Best, Sam

On Fri, May 3, 2024 at 9:52 AM Mohammad Reza Bakhtiarizadeh < @.***> wrote:

I used Ensembl GTF file and to be set with your code convert it based on your code "awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id "";"; }'". A head of my GTF file is provided as follow. It is worth noting that gtf2bed works on the gtf file.

!genome-build ARS-UCD1.2 transcript_id "";

!genome-version ARS-UCD1.2 transcript_id "";

!genome-date 2018-04 transcript_id "";

!genome-build-accession GCA_002263795.2 transcript_id "";

!genebuild-last-updated 2018-11 transcript_id "";

1 ensembl gene 339070 350389 . - . gene_id "ENSBTAG00000006648"; gene_version "6"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_id ""; 1 ensembl transcript 339070 350389 . - . gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical"; 1 ensembl exon 350267 350389 . - . gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSBTAE00000512015"; exon_version "1"; tag "Ensembl_canonical"; 1 ensembl CDS 350267 350389 . - 0 gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSBTAP00000008737"; protein_version "6"; tag "Ensembl_canonical"; 1 ensembl start_codon 350387 350389 . - 0 gene_id "ENSBTAG00000006648"; gene_version "6"; transcript_id "ENSBTAT00000008737"; transcript_version "6"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; tag "Ensembl_canonical";

— Reply to this email directly, view it on GitHub https://github.com/sjroth/ARTDeco/issues/26#issuecomment-2092490390, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEVD74ITJ23S4SHAMOYRZLZAM63PAVCNFSM6AAAAABHFBHGMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJSGQ4TAMZZGA . You are receiving this because you commented.Message ID: @.***>

mrb20045 commented 2 months ago

The original GTF file is obtained from ENSEMBL database. A head of the orginal GTF file is attached. Could you please send me a example correct gtf file.

head.txt

sjroth commented 2 months ago

I am asking for either a hyperlink to the file in the ENSEMBL database or the WHOLE file. I cannot help you if you don’t help with this request.

On Fri, May 3, 2024 at 9:59 AM Mohammad Reza Bakhtiarizadeh < @.***> wrote:

The original GTF file is obtained from ENSEMBL database. A head of the orginal GTF file is attached. Could you please send me a example correct gtf file.

head.txt https://github.com/sjroth/ARTDeco/files/15197414/head.txt

— Reply to this email directly, view it on GitHub https://github.com/sjroth/ARTDeco/issues/26#issuecomment-2092499851, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEVD75PCJMQZ5H2AUBRI7TZAM7VFAVCNFSM6AAAAABHFBHGMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJSGQ4TSOBVGE . You are receiving this because you commented.Message ID: @.***>

mrb20045 commented 2 months ago

Sure. https://ftp.ensembl.org/pub/release-110/gtf/bos_taurus/Bos_taurus.ARS-UCD1.2.110.gtf.gz

sjroth commented 2 months ago

Thank you! I will look into this. I’m in the process of a refactoring of ARTDeco so expect a new version soon.

On Fri, May 3, 2024 at 10:07 AM Mohammad Reza Bakhtiarizadeh < @.***> wrote:

Sure.

https://ftp.ensembl.org/pub/release-110/gtf/bos_taurus/Bos_taurus.ARS-UCD1.2.110.gtf.gz

— Reply to this email directly, view it on GitHub https://github.com/sjroth/ARTDeco/issues/26#issuecomment-2092518010, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEVD765RPWXRVI6JTEVR2LZANATVAVCNFSM6AAAAABHFBHGMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJSGUYTQMBRGA . You are receiving this because you commented.Message ID: @.***>

mrb20045 commented 1 month ago

Thanks.

I found the error is related to "-intergenic-max-len " option as when I add this option get the error. So it is not associated with GTF file. At the moment without that option it is running with the below code:

ARTDeco -mode preprocess -layout PE -stranded False -skip-bam-summary \ -gtf-file modified_genes.gtf \ -chrom-sizes-file ../0_GTF/genome.chrom.sizes -bam-files-dir $4 \ -cpu 60 -meta-file $5 -comparisons-file $6 \ -read-in-dist 1 -readthrough-dist 5 -intergenic-min-len 100

sjroth commented 1 month ago

My suspicion is that this is either a GTF error or a versioning error as I can see that you are using Python 3.9 rather than Python 3.6. If it is a versioning error, you need to change the versions of your software packages to exactly match those on the README or wait for my refactor to be completed (likely a few weeks to a month if I were to estimate).

On Sat, May 4, 2024 at 7:25 AM Mohammad Reza Bakhtiarizadeh < @.***> wrote:

Thanks.

Also, I check the UCSC GTF file and face the same error.

— Reply to this email directly, view it on GitHub https://github.com/sjroth/ARTDeco/issues/26#issuecomment-2094025434, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEVD75FUBRZYG6NEPBZWADZARWLNAVCNFSM6AAAAABHFBHGMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUGAZDKNBTGQ . You are receiving this because you commented.Message ID: @.***>

-- Sam Roth Hughes Fellow Founding Father, Alpha Psi Chapter, Alpha Epsilon Pi Fraternity BA Biology, Computer Science '13 Wesleyan University MA Biology '14 Wesleyan University

mrb20045 commented 1 month ago

My suspicion is that this is either a GTF error or a versioning error as I can see that you are using Python 3.9 rather than Python 3.6. If it is a versioning error, you need to change the versions of your software packages to exactly match those on the README or wait for my refactor to be completed (likely a few weeks to a month if I were to estimate). On Sat, May 4, 2024 at 7:25 AM Mohammad Reza Bakhtiarizadeh < @.> wrote: Thanks. Also, I check the UCSC GTF file and face the same error. — Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEVD75FUBRZYG6NEPBZWADZARWLNAVCNFSM6AAAAABHFBHGMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUGAZDKNBTGQ . You are receiving this because you commented.Message ID: @.> -- Sam Roth Hughes Fellow Founding Father, Alpha Psi Chapter, Alpha Epsilon Pi Fraternity BA Biology, Computer Science '13 Wesleyan University MA Biology '14 Wesleyan University

take a look at the previous comment, as I updated. I ran the tool now

sjroth commented 1 month ago

Ah I missed that comment. What was the original command that caused the error? I want to reproduce for the refactor.

mrb20045 commented 1 month ago

When I add "-intergenic-max-len 15" option. I check the error "ValueError: Cannot set a DataFrame with multiple columns to the single column Max Len" and found that.

ARTDeco -mode preprocess -layout PE -stranded False -skip-bam-summary -gtf-file modified_genes.gtf -chrom-sizes-file ../0_GTF/genome.chrom.sizes -bam-files-dir $4 -cpu 60 -meta-file $5 -comparisons-file $6 -read-in-dist 1 -readthrough-dist 5 -intergenic-min-len 100 -intergenic-max-len 15

sjroth commented 1 month ago

Ah. The error is that you set the maximum length shorter than the minimum length. In the future, please include the command you used.

mrb20045 commented 1 month ago

OK, so it should be in bp? Therefore, -read-in-dist and -readthrough-dist have to be 1000 nd 5000, respectively ?

On Sat, May 4, 2024 at 2:07 PM sjroth @.***> wrote:

Ah. The error is that you set the maximum length shorter than the minimum length. In the future, please include the command you used.

— Reply to this email directly, view it on GitHub https://github.com/sjroth/ARTDeco/issues/26#issuecomment-2094113802, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACLTVHUQI46MZ7TK6KHXAY3ZAS26ZAVCNFSM6AAAAABHFBHGMCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJUGEYTGOBQGI . You are receiving this because you authored the thread.Message ID: @.***>

-- Mohammad Reza Bakhtiarizadeh PhD in Genetics and Animal Breeding Department of animal and poultry science College of Aburaihan- University of Tehran- Iran Phone-Fax:+(98) 021-36040907 P.O : 33916-53775 Home page: https://profile.ut.ac.ir/en/~mrbakhtiari Google Scholar: https://scholar.google.com/citations?user=ZtpFeUgAAAAJ&hl=en Github: https://github.com/mrb20045 Researchgate: https://www.researchgate.net/profile/Mohammad_Reza_Bakhtiarizadeh​ ORCID iD: 0000-0001-5336-6987

sjroth commented 1 month ago

You are misunderstanding. You set intergenic-max-len lower than intergenic-min-len. You cannot mathematically set a maximum length lower than a minimum length. Additionally, as specified in the documentation, all distances are in bp. Please be sure to fully read the README.

mrb20045 commented 1 month ago

You are misunderstanding. You set intergenic-max-len lower than intergenic-min-len. You cannot mathematically set a maximum length lower than a minimum length. Additionally, as specified in the documentation, all distances are in bp. Please be sure to fully read the README.

Thanks.