signalbash / how_are_we_stranded_here

Check strandedness of RNA-Seq fastq files
MIT License
115 stars 25 forks source link

Can't find any of the first 10 BED transcript_ids in fasta file #22

Open eggrandio opened 3 months ago

eggrandio commented 3 months ago

Hi,

I am getting the

Can't find any of the first 10 BED transcript_ids in fasta file

error while the transcript_ids of the bed file are clearly in the transcript fasta file.

I parsed the .gtf file with 'agat_convert_sp_gff2gtf.pl' and made a custom bash script to retain only the transcript_id as the FASTA header so there are no issues. Here is a sample of the gtf, the bed and the fasta files: GTF:

#!genome-build DOE Joint Genome Institute Ccitriodora_v2_1
#!genome-version Ccitriodora_v2_1
#!genome-date 2020-10
#!genome-build-accession GCA_014858505.1
#!genebuild-last-updated 2021-09
1   Ccitriodora_v2_1    region  1   31549113    .   .   .   gene_id "region:1"; Alias "CM026410.1"; ID "region:1";
1   JGI gene    17519   18454   .   +   .   gene_id "gene-BT93_A0001"; ID "gene:gene-BT93_A0001"; biotype "protein_coding"; logic_name "genemodel_jgi";
1   JGI mRNA    17519   18454   .   +   .   gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; Parent "gene:gene-BT93_A0001"; biotype "protein_coding"; tag "Ensembl_canonical";
1   JGI exon    17519   17859   .   +   .   gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "2"; ensembl_phase "0"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E1"; rank "1";
1   JGI exon    17879   17949   .   +   .   gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "1"; ensembl_phase "2"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E2"; rank "2";
1   JGI exon    18051   18454   .   +   .   gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; Name "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; constitutive "1"; ensembl_end_phase "0"; ensembl_phase "1"; exon_id "rna-gnl|WGS:JABURB|Cocit.A0001.1-E3"; rank "3";
1   JGI CDS 17519   17859   .   +   0   gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";
1   JGI CDS 17879   17949   .   +   1   gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";
1   JGI CDS 18051   18454   .   +   2   gene_id "gene-BT93_A0001"; transcript_id "rna-gnl|WGS:JABURB|Cocit.A0001.1"; ID "CDS:cds-KAF8041257.1"; Parent "transcript:rna-gnl|WGS:JABURB|Cocit.A0001.1"; protein_id "cds-KAF8041257.1";

BED:

1   17518   18454   rna-gnl|WGS:JABURB|Cocit.A0001.1    0   +   17518   18454   0   3   341,71,404, 0,360,532,
1   31589   32512   rna-gnl|WGS:JABURB|Cocit.A0002.1    0   +   31589   32512   0   2   415,407,    0,516,
1   46163   47204   rna-gnl|WGS:JABURB|Cocit.A0003.1    0   +   46163   47204   0   2   441,475,    0,566,
1   62155   63760   rna-gnl|WGS:JABURB|Cocit.A0004.1    0   +   62155   63760   0   2   421,407,    0,1198,
1   64048   68743   rna-gnl|WGS:JABURB|Cocit.A0005.1    0   -   64048   68743   0   5   352,116,158,74,1541,    0,1195,1836,2993,3154,
1   70379   70571   rna-gnl|WGS:JABURB|Cocit.A0006.1    0   -   70379   70571   0   1   192,    0,
1   76482   79181   rna-gnl|WGS:JABURB|Cocit.A0007.1    0   +   76482   79181   0   3   360,387,570,    0,1132,2129,
1   76482   79181   rna-gnl|WGS:JABURB|Cocit.A0007.2    0   +   76482   79181   0   3   360,309,570,    0,1132,2129,
1   79626   88276   rna-gnl|WGS:JABURB|Cocit.A0008.1    0   -   79626   88276   0   10  551,102,86,272,92,138,71,181,157,243,   0,1512,1761,2584,4213,4387,5882,6738,7027,8407,
1   90394   103967  rna-gnl|WGS:JABURB|Cocit.A0009.1    0   +   90394   103967  0   17  450,91,119,191,211,241,416,109,200,350,97,219,240,210,286,197,882,  0,603,924,2440,2718,4600,5837,7150,7535,7867,8376,8552,9331,10114,11699,12126,12691,
1   111647  117022  rna-gnl|WGS:JABURB|Cocit.A0010.1    0   +   111647  117022  0   8   264,129,855,246,24,141,48,471,  0,1487,1745,3106,3458,3742,4009,4904,

FASTA:

>rna-gnl|WGS:JABURB|Cocit.A0001.1
ATGACAGCCCTCAAGCTCAAGAAGCTCCTCCTGACCGCCATCGCGGTCGCTGGGATCGTTGTCTCTGCTCTGCCTGACACCGCCTCGGCCCAGAACTGCGGGTGTGCAGCCAACC

Maybe it is because of the symbols in the transcript names?

etwatson commented 2 weeks ago

So, there is an edit that is needed to fix this problem, especially if you installed with pip3, since it doesn't appear that the fix is in that dist. Check the thread here.

To find where your check_strandedness.py file is, run: pip3 show how_are_we_stranded_here

Then, you can make the edit there to fix the problem.

Oh, and you will probably need to fix another bug as stated here.

Your welcome @signalbash