sjroth / ARTDeco

MIT License
15 stars 7 forks source link

Working with TCGA data: Problems in formatting? #28

Open faleevz opened 3 weeks ago

faleevz commented 3 weeks ago

Hi Sam, its me again here.

Tried running the program again with TCGA samples, but unfortunately immediately ran into this new error.

When running the preprocessing, I realise that the tag directories are created differently with the TCGA samples vs. my old samples. TCGA with have Chr1. tag, whilst old samples will have 1.tag file., also theres lots of generation of decoy tags? Apart from the firt file that gets processed, none of the other files get all of their tags generated. (only Chr1 and 2 for example).

Downstream from this, readthrough mode yields empty results? All the files get generated but all the values are either 0 or NA.

I have attached an example of the corrected_exp.txt below.

Any idea on what may be causing this? Thank you! readthrough.raw.txt

(artdeco) [mfaleeva@stpr-expanded01 input_files]$ ARTDeco -home-dir /dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results -bam-files-dir /dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/bam_files -gtf-file /dawson_genomics/Projects/SF3B3_runthrough/input_files/newgenes.gtf -cpu 4 -chrom-sizes-file /dawson_genomics/Projects/SF3B3_runthrough/input_files/genome.chrom.siz
es -layout PE -stranded False
No valid run mode specified... Will generate all files...
Loading ARTDeco file structure...
ARTDeco will generate the following files:
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/quantification/max_isoform.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/preprocess_files/gene_to_transcript.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/5852bf03-0a4a-491a-ac82-e452552c526a.rna_seq.genomic.gdc_realn.dogs.fpkm.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/quantification/readthrough.raw.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/all_dogs.bed
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/preprocess_files/genes_condensed.bed
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/readthrough/read_in.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/preprocess_files/cdc83302-e4d3-4b0d-b75d-a38edc5b5561.rna_seq.genomic.gdc_realn
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/readthrough/corrected_exp.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/preprocess_files/read_in.bed
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/preprocess_files/1b48c9e1-8bb4-4291-982d-02e9aba18c38.rna_seq.genomic.gdc_realn
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/all_dogs.fpkm.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/quantification
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/preprocess_files/readthrough.bed
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/cdc83302-e4d3-4b0d-b75d-a38edc5b5561.rna_seq.genomic.gdc_realn.dogs.raw.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/1b48c9e1-8bb4-4291-982d-02e9aba18c38.rna_seq.genomic.gdc_realn.dogs.raw.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/all_dogs.raw.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/1b48c9e1-8bb4-4291-982d-02e9aba18c38.rna_seq.genomic.gdc_realn.dogs.bed
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/readthrough/read_in_assignments.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/5852bf03-0a4a-491a-ac82-e452552c526a.rna_seq.genomic.gdc_realn.dogs.raw.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/5852bf03-0a4a-491a-ac82-e452552c526a.rna_seq.genomic.gdc_realn.dogs.bed
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/quantification/gene.exp.raw.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/preprocess_files/genes.full.bed
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/1b48c9e1-8bb4-4291-982d-02e9aba18c38.rna_seq.genomic.gdc_realn.dogs.fpkm.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/preprocess_files/5852bf03-0a4a-491a-ac82-e452552c526a.rna_seq.genomic.gdc_realn
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/cdc83302-e4d3-4b0d-b75d-a38edc5b5561.rna_seq.genomic.gdc_realn.dogs.fpkm.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/quantification/read_in.raw.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/dogs/cdc83302-e4d3-4b0d-b75d-a38edc5b5561.rna_seq.genomic.gdc_realn.dogs.bed
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/preprocess_files/gene_types.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/quantification/gene.exp.fpkm.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/readthrough/readthrough.txt
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/artdeco_results/readthrough
GTF file needed... Checking...
GTF file exists...
BAM file format needed... Checking... Will infer if not user-specified.
BAM files specified as paired-end...
BAM files specified as unstranded...
No strand orientation specified... Data is unstranded... No need to infer orientation...
Summarizing BAM file stats...
3 Experiments
Files are Paired-End, Unstranded
                                                                                                                                           Experiment  Total Reads  Mapped Reads
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/bam_files/cdc83302-e4d3-4b0d-b75d-a38edc5b5561.rna_seq.genomic.gdc_realn.bam    201527601     174019350
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/bam_files/5852bf03-0a4a-491a-ac82-e452552c526a.rna_seq.genomic.gdc_realn.bam    222771993     188663278
/dawson_genomics/Projects/SF3B3_runthrough/GDC/gdc_01/genomic_files/TCGA/bam_files/1b48c9e1-8bb4-4291-982d-02e9aba18c38.rna_seq.genomic.gdc_realn.bam    230670349     206048000
Convert GTF to BED...
Warning: If your Wiggle data is a significant portion of available system memory, use the --max-mem and --sort-tmpdir options, or use --do-not-sort to disable post-conversion sorting. See --help for more information.
Generating condensed genes bed...
/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/ARTDeco-0.4-py3.10.egg/ARTDeco/preprocess.py:59: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
Generating read-in region BED file...
/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/ARTDeco-0.4-py3.10.egg/ARTDeco/preprocess.py:349: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Generating readthrough region BED file...
Creating tag directories...
Creating quantification directory...
Generating gene expression files...
Getting maximum isoform...
Generating read-in expression file...
Generating readthrough expression file...
Creating readthrough directory...
Generate read-in vs. expression file...
/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/ARTDeco-0.4-py3.10.egg/ARTDeco/readthrough.py:113: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Correcting gene expression using read-in information...
Generate readthrough vs. expression file...
/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/ARTDeco-0.4-py3.10.egg/ARTDeco/readthrough.py:113: FutureWarning: The series.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
Read-in genes assigned with read-in level threshold is -1 and read-in FPKM threshold is 0.25...
Using all genes...
Summarizing readthrough output...
Using all genes...
Read-In Summary
       cdc83302_e4d3_4b0d_b75d_a38edc5b5561.rna_seq.genomic.gdc_realn log2Ratio Read-In vs. Gene  5852bf03_0a4a_491a_ac82_e452552c526a.rna_seq.genomic.gdc_realn log2Ratio Read-In vs. Gene  1b48c9e1_8bb4_4291_982d_02e9aba18c38.rna_seq.genomic.gdc_realn log2Ratio Read-In vs. Gene
count                                                                                     1000.0                                                                                     1000.0                                                                                     1000.0
mean                                                                                         0.0                                                                                        0.0                                                                                        0.0
std                                                                                          0.0                                                                                        0.0                                                                                        0.0
min                                                                                          0.0                                                                                        0.0                                                                                        0.0
25%                                                                                          0.0                                                                                        0.0                                                                                        0.0
50%                                                                                          0.0                                                                                        0.0                                                                                        0.0
75%                                                                                          0.0                                                                                        0.0                                                                                        0.0
max                                                                                          0.0                                                                                        0.0                                                                                        0.0
Readthrough Summary
       cdc83302_e4d3_4b0d_b75d_a38edc5b5561.rna_seq.genomic.gdc_realn log2Ratio Readthrough vs. Gene  5852bf03_0a4a_491a_ac82_e452552c526a.rna_seq.genomic.gdc_realn log2Ratio Readthrough vs. Gene  1b48c9e1_8bb4_4291_982d_02e9aba18c38.rna_seq.genomic.gdc_realn log2Ratio Readthrough vs. Gene
count                                                                                         1000.0                                                                                         1000.0                                                                                         1000.0
mean                                                                                             0.0                                                                                            0.0                                                                                            0.0
std                                                                                              0.0                                                                                            0.0                                                                                            0.0
min                                                                                              0.0                                                                                            0.0                                                                                            0.0
25%                                                                                              0.0                                                                                            0.0                                                                                            0.0
50%                                                                                              0.0                                                                                            0.0                                                                                            0.0
75%                                                                                              0.0                                                                                            0.0                                                                                            0.0
max                                                                                              0.0                                                                                            0.0                                                                                            0.0
Read-In Assignments for each experiment for threshold of -1 and FPKM >= 0.25
Empty DataFrame
Columns: [cdc83302_e4d3_4b0d_b75d_a38edc5b5561.rna_seq.genomic.gdc_realn Assignment, 5852bf03_0a4a_491a_ac82_e452552c526a.rna_seq.genomic.gdc_realn Assignment, 1b48c9e1_8bb4_4291_982d_02e9aba18c38.rna_seq.genomic.gdc_realn Assignment]
Index: []
Creating DoG output directory...
Finding DoGs...
Get genes with potential DoGs with minimum length of 4000 bp, a minimum coverage of 0.15 FPKM, and screening window of 500 bp...
Generate initial screening BED file for DoGs with minimum length 4000 bp and window size 500 bp...
Initial screening coverage for DoGs with minimum length of 4000 bp...
Generate screening BED file for pre-screened DoGs...
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/ARTDeco-0.4-py3.10.egg/ARTDeco/get_dogs.py", line 98, in get_all_intervals
    downstream_stop_dict = downstream_stop_df.set_index('Name').T.to_dict('list')
  File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/pandas/util/_decorators.py", line 331, in wrapper
    return func(*args, **kwargs)
  File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/pandas/core/frame.py", line 6012, in set_index
    raise KeyError(f"None of {missing} are in the columns")
KeyError: "None of ['Name'] are in the columns"
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/mfaleeva/.conda/envs/artdeco/bin/ARTDeco", line 33, in <module>
    sys.exit(load_entry_point('ARTDeco==0.4', 'console_scripts', 'ARTDeco')())
  File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/ARTDeco-0.4-py3.10.egg/ARTDeco/main.py", line 595, in main
  File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/ARTDeco-0.4-py3.10.egg/ARTDeco/get_dogs.py", line 315, in generate_full_screening_bed
  File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
KeyError: "None of ['Name'] are in the columns"
sjroth commented 3 weeks ago

This is a pandas version issue. What version are you using?

On Tue, Jun 4, 2024 at 1:11 AM faleevz @.***> wrote:

Hi Sam, its me again here. Tried running the program again with TCGA samples, but unfortunately immediately ran into this new error:

ARTDeco -home-dir ARTDECO_DIR -bam-files-dir BAM_FILES_DIR -gtf-file GTF_FILE -cpu 10 -chrom-sizes-file CHROM_SIZES_FILE -layout PE -stranded False No valid run mode specified... Will generate all files... Loading ARTDeco file structure... Meta file properly formatted... Generating reformatted meta... Traceback (most recent call last): File "/home/mfaleeva/.conda/envs/artdeco/bin/ARTDeco", line 33, in sys.exit(load_entry_point('ARTDeco==0.4', 'console_scripts', 'ARTDeco')()) File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/ARTDeco-0.4-py3.10.egg/ARTDeco/main.py", line 153, in main File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/ARTDeco-0.4-py3.10.egg/ARTDeco/DESeq2.py", line 22, in reformat_meta File "/home/mfaleeva/.conda/envs/artdeco/lib/python3.10/site-packages/pandas/core/generic.py", line 5902, in getattr return object.getattribute(self, name) AttributeError: 'DataFrame' object has no attribute 'Group'. Did you mean: 'drop'?

Any advice on how to proceed? Thank you!

— Reply to this email directly, view it on GitHub https://github.com/sjroth/ARTDeco/issues/28, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEVD76T4OEHFIIKNTESFBDZFVD7BAVCNFSM6AAAAABIXX2ADWVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZTENJWGY3DMNY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

faleevz commented 3 weeks ago

pandas version 1.5.3.

also it works fine when i try running it on my previous samples ( i have not changed the artdeco environment).

sjroth commented 3 weeks ago

Sorry. You have changed the post/content since I last saw it. Can you verify that it created a read_in.bed and a reathrough.bed?

faleevz commented 3 weeks ago

Yes. Both files are generated. All look ok to by the header and visualisation on IGV.

The only difference I can spot between them and the previous read_in and readthrough files that worked is that these are a bit smaller (800KB), whilst the others were (1.9MB).

Screenshot 2024-06-05 at 6 27 42 am Screenshot 2024-06-05 at 6 28 27 am