Closed drdariarago closed 4 years ago
What version of tximeta?
1.4.2
Can you take a look into the aux_info
dir and meta_info.json
file for the quant output?
The correct hash for GENCODE 32 for human should start with "4260e6".
Looks like the salmon quant hashes are incorrect.
"index_seq_hash": "4ee15dac635a67e71190aa28623e5ab8cbcde4ac9c0dad0effdace9e32a8fa5d",
"index_name_hash": "9edc28415f23c1eea85c9be24269e331797556a48d56c21158af37951b737bfc",
"index_seq_hash512": "4918efb52c9317514871a4cc41208d51492f5c71eca23431d572507fab0f2127cfaf4add2472da64d54ee79747c28d1291075528c6f5babcaf7f0dfaade00336",
"index_name_hash512": "1c892eab443b9c19feb67e86c065e8f33daa3156b3366c72349e31d15260a956fb0878b0b6ff7687ebb5fdf7c0bc2da3b538aa3310f95bfeba479c0a339053ef",
"index_decoy_seq_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"index_decoy_name_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
@rob-p any idea? The decoy shouldnt matter right? I’ll double check the hash in 15 min.
The same snakemake pipeline works fine for Encode data so I doubt it is the decoy but can run a test on that. The code for decoy creation is:
# Crate decoy file
echo "Creating Decoy File"
grep "^>" <(zcat {input.genome}) | cut -d " " -f 1 > {output.decoys} &&
sed -i -e 's/>//g' {output.decoys} &&
# Concatenate genome and transcriptome
echo "Concatenating genome and transcriptome"
zcat {input.transcriptome} {input.genome} > {output.gentrome} &&
Using zcat to avoid creating temporary uncompressed files.
Is it possible to see the whole meta_info.json file?
Sure
{
"salmon_version": "1.1.0",
"samp_type": "bootstrap",
"opt_type": "vb",
"quant_errors": [],
"num_libraries": 1,
"library_types": [
"ISR"
],
"frag_dist_length": 1001,
"seq_bias_correct": true,
"gc_bias_correct": true,
"num_bias_bins": 4096,
"mapping_type": "mapping",
"num_valid_targets": 226608,
"num_decoy_targets": 0,
"num_eq_classes": 547237,
"serialized_eq_classes": false,
"eq_class_properties": [
"range_factorized"
],
"length_classes": [
538,
717,
1193,
2355,
109224
],
"index_seq_hash": "4ee15dac635a67e71190aa28623e5ab8cbcde4ac9c0dad0effdace9e32a8fa5d",
"index_name_hash": "9edc28415f23c1eea85c9be24269e331797556a48d56c21158af37951b737bfc",
"index_seq_hash512": "4918efb52c9317514871a4cc41208d51492f5c71eca23431d572507fab0f2127cfaf4add2472da64d54ee79747c28d1291075528c6f5babcaf7f0dfaade00336",
"index_name_hash512": "1c892eab443b9c19feb67e86c065e8f33daa3156b3366c72349e31d15260a956fb0878b0b6ff7687ebb5fdf7c0bc2da3b538aa3310f95bfeba479c0a339053ef",
"index_decoy_seq_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"index_decoy_name_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"num_bootstraps": 30,
"num_processed": 38192379,
"num_mapped": 27927337,
"num_decoy_fragments": 0,
"num_dovetail_fragments": 1146421,
"num_fragments_filtered_vm": 3794976,
"num_alignments_below_threshold_for_mapped_fragments_vm": 139706865,
"percent_mapped": 73.12280023195203,
"call": "quant",
"start_time": "Fri Jan 17 13:19:03 2020",
"end_time": "Fri Jan 17 13:52:35 2020"
}
I just ran Salmon 1.1.0 on GENCODE 32 with --gencode and I obtain "4260e6..." for the index seq hash. Maybe can @alfredorago confirm the source of the FASTA file?
Thanks! So this log shows no decoys. It looks like the decoys are being treated as valid targets, thus messing up the signatures of the regular transcriptome. Are you sure the names in the decoys.txt file passed to salmon indexing match the gentrome fasta header entries?
I see, I missed Rob's point, so my comment above is irrelevant.
This also suggests to me that, perhaps, the default behavior should be more picky. Right now, the indexing writes warnings to the logs if there are names in the decoys.txt that it doesn't find in the input gentrome. However, now I'm thinking the best behavior might be to exit with a non-zero exit status and write a big warning, as it's easy to overlook non-matches if you are not monitoring the log.
Checking the decoys file, it is empty. Not sure if this kind of situation warrants its own check but it might prevent similar mistakes for users.
I will try and correct the code (above) that creates decoys from the source genome. I suspect it was caused by my blind copy-pasting the regexpr for selecting transcript names.
Thanks for filing the issue @alfredorago, helps us to give better warnings/messaging, and preserve the txome matching functionality.
100% agree that this warrants its own check. It's hard to catch 100% of the cases, but I think the two that are clearly errors are:
There exist decoy names in the decoys.txt
file for which there was no matched record in the fasta.
The decoys.txt file is empty.
I'll take a stab at implementing these upstream ASAP.
Is this Issue now solved, can @alfredorago check the newest version?
I'm having the same issue with GENCODEv33 used in Salmon and tximeta returning
couldn't find matching transcriptome, returning un-ranged SummarizedExperiment
here's my salmon meta_info.json file
{
"salmon_version": "1.1.0",
"samp_type": "none",
"opt_type": "vb",
"quant_errors": [],
"num_libraries": 1,
"library_types": [
"ISR"
],
"frag_dist_length": 1001,
"seq_bias_correct": false,
"gc_bias_correct": true,
"num_bias_bins": 4096,
"mapping_type": "mapping",
"num_valid_targets": 227063,
"num_decoy_targets": 0,
"num_eq_classes": 1516552,
"serialized_eq_classes": false,
"eq_class_properties": [
"range_factorized"
],
"length_classes": [
539,
718,
1197,
2359,
104301
],
"index_seq_hash": "841e05c302e1decc927bd09868588071e9dc3cc36170c767ae9ee4a28079b616",
"index_name_hash": "29358214629101a93d56e38b317fa702d3f0febbf95cd4d46b617e257068ffc8",
"index_seq_hash512": "25444653408bdb7cce7213a21cdc5de87cef1201360d3c3d49b56fd721dfce0f8fc46c53978dca8e1868e280ac9b63a7d143206e44f2b22a9c1bb4fcf93a9c01",
"index_name_hash512": "11bbd339e78c89a8c8066332f6c492644abed71e7c7c50a373306f3061b9bea98f512172ac32e977b2d4a58bfcd261d25b05b8ffa74747a5af34f994b43317ba",
"index_decoy_seq_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"index_decoy_name_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
"num_bootstraps": 0,
"num_processed": 102768125,
"num_mapped": 97790293,
"num_decoy_fragments": 0,
"num_dovetail_fragments": 399731,
"num_fragments_filtered_vm": 1246054,
"num_alignments_below_threshold_for_mapped_fragments_vm": 82635196,
"percent_mapped": 95.15624908015009,
"call": "quant",
"start_time": "Wed Mar 25 07:53:20 2020",
"end_time": "Wed Mar 25 08:34:27 2020"
}
Are you using tximeta 1.4.5?
I was certain it was as I had installed it off bioconductor but it turns out it actually downloads v1.2.2, sorry. I'll figure out how to get 1.4.5 and try again. thanks
sorry, seems to work fine with 1.4.5!!
I am trying to import Salmon (v1.1.0) data via tximeta. When I use the Encode files for mapping, tximeta runs smoothly but removes a few thousand transcripts due to the usual mismatch between fasta and gtf files. When I map to GENCODEv32 data however, tximeta does not seem to recognize the metadata and gives me error
I am using the last version of the package, and I have selected the
--gencode
option for Salmon mapping.For reference, I am using this transcriptome: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.transcripts.fa.gz and this genome annotation: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gtf.gz
Any idea why this might be happening? Thanks for the cool package Alfredo