suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
226 stars 50 forks source link

using own fa file and gtf file error: gzip: stdin: not in gzip format #250

Open ppxinyi opened 2 months ago

ppxinyi commented 2 months ago

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa to .gz file and add this link in the download.sh, it will show the error: gzip: stdin: not in gzip format.

If I use .fa file directly, it will have another error: EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is ' ' (10), not '>'. Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

I have check the first character: head Homo_sapiens.GRCh38.dna.primary_assembly.fa

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Do you have some suggestion for this problem when I want to use the diy fa file and gtf file?

suhrig commented 2 months ago

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa

What command did you use to compress the file? You must use gzip for this, not zip.

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is ' ' (10), not '>'.

Your FastA file appears to start with a line break. Remove the line break and make sure that every line containing a chromosome name begins with a >, for example:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ppxinyi commented 2 months ago

When I zip the Homo_sapiens.GRCh38.dna.primary_assembly.fa

What command did you use to compress the file? You must use gzip for this, not zip.

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is ' ' (10), not '>'.

Your FastA file appears to start with a line break. Remove the line break and make sure that every line containing a chromosome name begins with a >, for example:

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

sorry for missing copy the >, I think I alreday have such format, but it still show the same error :

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: my_genomeviral.fa is not fasta: the first character is ' ' (10), not '>'. Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

Sep 05 00:16:05 ...... FATAL ERROR, exiting

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

ppxinyi commented 2 months ago

I just find the > can not be copies

suhrig commented 2 months ago

Your FastA file starts with a line break. You must remove the line break.

sed -i '/^$/D' my_genomeviral.fa
ppxinyi commented 2 months ago

Your FastA file starts with a line break. You must remove the line break.

sed -i '/^$/D' my_genomeviral.fa

yes. I have use this before but it still show the same error

head Homo_sapiens.GRCh38.dna.primary_assembly.fa

1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

suhrig commented 2 months ago

Hm, it probably has something to do with how you added the file to the download_references.sh script. Can you attach your modified script?

ppxinyi commented 2 months ago

ASSEMBLIES[my_genome]=" https://m-ee4cea.a1bfb5.bd7c.data.globus.org/scratch/chadbren_root/chadbren99/ppxinyi/referenceGRC38/ Homo_sapiens.GRCh38.dna.primary_assembly.fa https://m-d953b2.9601a.bd7c.data.globus.org/umms-chadbren-dataden/apurvadb/10236-CB/Homo_sapiens.GRCh38.dna.primary_assembly.fa ANNOTATIONS[my_annotation]=" https://m-ee4cea.a1bfb5.bd7c.data.globus.org/scratch/chadbren_root/chadbren99/ppxinyi/referenceGRC38/Homo_sapiens.GRCh38.105.transcript.gtf COMBINATIONS["my_genome+my_annotation"]="my_genome+my_annotation"

I just added these links in the download_references.sh.

I also find that if I use the ASSEMBLIES file you set and use my own ANNOTATIONS file, it will show such error: Sep 05 01:30:42 ...... FATAL ERROR, exiting

.log file: Downloading assembly: http://ftp.ensembl.org/pub/release-112/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz Appending RefSeq viral genomes Downloading annotation: https://m-ee4cea.a1bfb5.bd7c.data.globus.org/scratch/chadbren_root/chadbren99/ppxinyi/referenceGRC38/Homo_sapiens.GRCh38.105.transcript.gtf STAR --runMode genomeGenerate --genomeDir STAR_index_my_genomeviral_my_annotation --genomeFastaFiles my_genomeviral.fa --sjdbGTFfile my_annotation.gtf --runThreadN 8 --sjdbOverhang 250 STAR version: 2.7.11b compiled: 2024-08-28T19:55:43-04:00 gl-login2.arc-ts.umich.edu: /gpfs/accounts/chadbren_root/chadbren99/ppxinyi/STAR-2.7.11b/source Sep 05 01:29:52 ..... started STAR run Sep 05 01:29:52 ... starting to generate Genome files Sep 05 01:30:42 ..... processing annotations GTF

On Thu, Sep 5, 2024 at 2:23 AM suhrig @.***> wrote:

Hm, it probably has something to do with how you added the file to the download_references.sh script. Can you attach your modified script?

— Reply to this email directly, view it on GitHub https://github.com/suhrig/arriba/issues/250#issuecomment-2330696611, or unsubscribe https://github.com/notifications/unsubscribe-auth/BCFUQI2MABJ472CTQD4AAZLZU72FZAVCNFSM6AAAAABNVSZ3DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZQGY4TMNRRGE . You are receiving this because you authored the thread.Message ID: @.***>

suhrig commented 1 month ago

Sorry for the show response. I am currently busy. I will have more time next week.

In the meantime, have you tried building the STAR index yourself? What the script does is actually not so complicated. All it does is concatenate the RefSeq viruses to the human genome and then builds a STAR index from it. Maybe it's the easiest of you build the index manually?