New installation not reading hg data

davidaray commented 2 years ago

I just completed installing the package using the instructions on the main github page and attempted to run the human data as a test.

The software attempts to run but hits an error nearly immediately. The command used and errors are below.

I have some experience with python and it seems clear that this is a read-in error but I don't understand why the program would be experiencing this error with the data provided. The program seems to be misinterpreting the column headers as gene information.

I attempted to correct the problem by re-importing the Cleaned_Chr7_13_Human_Genes.tsv that was provided as a pandas dataframe and resaving it but with no luck.

Would you have any ideas how to fix this problem?

David

. ~/conda/etc/profile.d/conda.sh
conda activate tedensity
cd /lustre/work/daray/software/TE_Density
source tedensity-virt/bin/activate

GENOME=hg
RUNTYPE=${GENOME}_tedensity
DIR=/lustre/scratch/daray/tedensity/$RUNTYPE
TEDATA=/lustre/work/daray/software/TE_Density/TE_Density_Filtered_Gene_and_TE_Annotations/Cleaned_Chr7_13_Human_TEs.tsv
GENEDATA=Cleaned_Chr7_13_Human_Genes.tsv
OUTPUT_DIR=${GENOME}_tedensity
PROGRAMDIR=/lustre/work/daray/software/TE_Density

python $PROGRAMDIR/process_genome.py \
> $TEDATA \
> $GENEDATA \
> $GENOME \
> -c $PROGRAMDIR/config/production_run_config.ini \
> -n 36 \
> -o $DIR

Output with errors:

2022-07-16 19:39:52 cpu-26-7 __main__[83187] INFO preprocessing...
2022-07-16 19:39:52 cpu-26-7 PreProcessor[83187] INFO Reading pre-filtered gene annotation file /lustre/work/daray/software/TE_Density/TE_Density_Filtered_Gene_and_TE_Annotations/Cleaned_Chr7_13_Human_TEs.tsv
Traceback (most recent call last):
  File "/lustre/work/daray/software/TE_Density/transposon/import_filtered_genes.py", line 18, in import_filtered_genes
    gene_data = pd.read_csv(
  File "/lustre/work/daray/software/TE_Density/tedensity-virt/lib/python3.8/site-packages/pandas/io/parsers.py", line 676, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/lustre/work/daray/software/TE_Density/tedensity-virt/lib/python3.8/site-packages/pandas/io/parsers.py", line 454, in _read
    data = parser.read(nrows)
  File "/lustre/work/daray/software/TE_Density/tedensity-virt/lib/python3.8/site-packages/pandas/io/parsers.py", line 1133, in read
    ret = self._engine.read(nrows)
  File "/lustre/work/daray/software/TE_Density/tedensity-virt/lib/python3.8/site-packages/pandas/io/parsers.py", line 2113, in read
    index, names = self._make_index(data, alldata, names)
  File "/lustre/work/daray/software/TE_Density/tedensity-virt/lib/python3.8/site-packages/pandas/io/parsers.py", line 1556, in _make_index
    index = self._get_simple_index(alldata, columns)
  File "/lustre/work/daray/software/TE_Density/tedensity-virt/lib/python3.8/site-packages/pandas/io/parsers.py", line 1588, in _get_simple_index
    i = ix(idx)
  File "/lustre/work/daray/software/TE_Density/tedensity-virt/lib/python3.8/site-packages/pandas/io/parsers.py", line 1583, in ix
    raise ValueError(f"Index {col} invalid")
ValueError: Index Gene_Name invalid

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/lustre/work/daray/software/TE_Density//process_genome.py", line 276, in <module>
    preprocessor.process()
  File "/lustre/work/daray/software/TE_Density/transposon/preprocess.py", line 102, in process
    gene_frame = self._filter_genes()
  File "/lustre/work/daray/software/TE_Density/transposon/preprocess.py", line 161, in _filter_genes
    gene_data_unwrapped = verify_gene_cache(self.gene_in, self._logger)
  File "/lustre/work/daray/software/TE_Density/transposon/verify_cache.py", line 124, in verify_gene_cache
    gene_data = import_filtered_genes(genes_input_file, logger)
  File "/lustre/work/daray/software/TE_Density/transposon/import_filtered_genes.py", line 33, in import_filtered_genes
    raise ValueError(
ValueError: Error occurred while trying to read preprocessed gene
                         annotation file into a Pandas dataframe, please refer
                         to the README as to what information is expected

If it helps, I get the same error with my own data.

sjteresi commented 2 years ago

Hello,

First of all, thank you for using my tool and I hope I can be of help! Looking at your job submission script it appears you swapped the order of the input arguments. The gene data needs to be the first argument, rather than the second.

It needs to be:

python $PROGRAMDIR/process_genome.py \
$GENEDATA \
$TEDATA \
$GENOME \
-c $PROGRAMDIR/config/production_run_config.ini \
-n 36 \
-o $DIR

Sincerely, Scott Teresi

davidaray commented 2 years ago

Thanks. I'd never have noticed something so simple.

sjteresi / TE_Density

New installation not reading hg data #110