GENCODE transcriptome hashes not recognized

drdariarago commented 4 years ago

I am trying to import Salmon (v1.1.0) data via tximeta. When I use the Encode files for mapping, tximeta runs smoothly but removes a few thousand transcripts due to the usual mismatch between fasta and gtf files. When I map to GENCODEv32 data however, tximeta does not seem to recognize the metadata and gives me error

couldn't find matching transcriptome, returning non-ranged SummarizedExperiment
Error in missingMetadata(object, summarize = TRUE) : 
  use of this function requires transcriptome metadata which is missing.
  use a linkedTxome to provide the missing metadata and rerun tximeta()
  or use tx2gene, txOut=FALSE (and skipMeta=TRUE if Salmon)
Calls: summarizeToGene -> summarizeToGene -> .local -> missingMetadata

I am using the last version of the package, and I have selected the --gencode option for Salmon mapping.

For reference, I am using this transcriptome: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.transcripts.fa.gz and this genome annotation: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gtf.gz

Any idea why this might be happening? Thanks for the cool package Alfredo

mikelove commented 4 years ago

What version of tximeta?

drdariarago commented 4 years ago

1.4.2

mikelove commented 4 years ago

Can you take a look into the aux_info dir and meta_info.json file for the quant output?

The correct hash for GENCODE 32 for human should start with "4260e6".

drdariarago commented 4 years ago

Looks like the salmon quant hashes are incorrect.

    "index_seq_hash": "4ee15dac635a67e71190aa28623e5ab8cbcde4ac9c0dad0effdace9e32a8fa5d",
    "index_name_hash": "9edc28415f23c1eea85c9be24269e331797556a48d56c21158af37951b737bfc",
    "index_seq_hash512": "4918efb52c9317514871a4cc41208d51492f5c71eca23431d572507fab0f2127cfaf4add2472da64d54ee79747c28d1291075528c6f5babcaf7f0dfaade00336",
    "index_name_hash512": "1c892eab443b9c19feb67e86c065e8f33daa3156b3366c72349e31d15260a956fb0878b0b6ff7687ebb5fdf7c0bc2da3b538aa3310f95bfeba479c0a339053ef",
    "index_decoy_seq_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "index_decoy_name_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",

mikelove commented 4 years ago

@rob-p any idea? The decoy shouldnt matter right? I’ll double check the hash in 15 min.

drdariarago commented 4 years ago

The same snakemake pipeline works fine for Encode data so I doubt it is the decoy but can run a test on that. The code for decoy creation is:

      # Crate decoy file
      echo "Creating Decoy File"
      grep "^>" <(zcat {input.genome}) | cut -d " " -f 1 > {output.decoys} &&
      sed -i -e 's/>//g' {output.decoys} &&

      # Concatenate genome and transcriptome
      echo "Concatenating genome and transcriptome"
      zcat {input.transcriptome} {input.genome} > {output.gentrome} &&

Using zcat to avoid creating temporary uncompressed files.

rob-p commented 4 years ago

Is it possible to see the whole meta_info.json file?

drdariarago commented 4 years ago

Sure

{
    "salmon_version": "1.1.0",
    "samp_type": "bootstrap",
    "opt_type": "vb",
    "quant_errors": [],
    "num_libraries": 1,
    "library_types": [
        "ISR"
    ],
    "frag_dist_length": 1001,
    "seq_bias_correct": true,
    "gc_bias_correct": true,
    "num_bias_bins": 4096,
    "mapping_type": "mapping",
    "num_valid_targets": 226608,
    "num_decoy_targets": 0,
    "num_eq_classes": 547237,
    "serialized_eq_classes": false,
    "eq_class_properties": [
        "range_factorized"
    ],
    "length_classes": [
        538,
        717,
        1193,
        2355,
        109224
    ],
    "index_seq_hash": "4ee15dac635a67e71190aa28623e5ab8cbcde4ac9c0dad0effdace9e32a8fa5d",
    "index_name_hash": "9edc28415f23c1eea85c9be24269e331797556a48d56c21158af37951b737bfc",
    "index_seq_hash512": "4918efb52c9317514871a4cc41208d51492f5c71eca23431d572507fab0f2127cfaf4add2472da64d54ee79747c28d1291075528c6f5babcaf7f0dfaade00336",
    "index_name_hash512": "1c892eab443b9c19feb67e86c065e8f33daa3156b3366c72349e31d15260a956fb0878b0b6ff7687ebb5fdf7c0bc2da3b538aa3310f95bfeba479c0a339053ef",
    "index_decoy_seq_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "index_decoy_name_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "num_bootstraps": 30,
    "num_processed": 38192379,
    "num_mapped": 27927337,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 1146421,
    "num_fragments_filtered_vm": 3794976,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 139706865,
    "percent_mapped": 73.12280023195203,
    "call": "quant",
    "start_time": "Fri Jan 17 13:19:03 2020",
    "end_time": "Fri Jan 17 13:52:35 2020"
}

mikelove commented 4 years ago

I just ran Salmon 1.1.0 on GENCODE 32 with --gencode and I obtain "4260e6..." for the index seq hash. Maybe can @alfredorago confirm the source of the FASTA file?

rob-p commented 4 years ago

Thanks! So this log shows no decoys. It looks like the decoys are being treated as valid targets, thus messing up the signatures of the regular transcriptome. Are you sure the names in the decoys.txt file passed to salmon indexing match the gentrome fasta header entries?

mikelove commented 4 years ago

I see, I missed Rob's point, so my comment above is irrelevant.

rob-p commented 4 years ago

This also suggests to me that, perhaps, the default behavior should be more picky. Right now, the indexing writes warnings to the logs if there are names in the decoys.txt that it doesn't find in the input gentrome. However, now I'm thinking the best behavior might be to exit with a non-zero exit status and write a big warning, as it's easy to overlook non-matches if you are not monitoring the log.

drdariarago commented 4 years ago

Checking the decoys file, it is empty. Not sure if this kind of situation warrants its own check but it might prevent similar mistakes for users.

I will try and correct the code (above) that creates decoys from the source genome. I suspect it was caused by my blind copy-pasting the regexpr for selecting transcript names.

mikelove commented 4 years ago

Thanks for filing the issue @alfredorago, helps us to give better warnings/messaging, and preserve the txome matching functionality.

rob-p commented 4 years ago

100% agree that this warrants its own check. It's hard to catch 100% of the cases, but I think the two that are clearly errors are:

There exist decoy names in the decoys.txt file for which there was no matched record in the fasta.
The decoys.txt file is empty.

I'll take a stab at implementing these upstream ASAP.

mikelove commented 4 years ago

Is this Issue now solved, can @alfredorago check the newest version?

pguilha commented 4 years ago

I'm having the same issue with GENCODEv33 used in Salmon and tximeta returning

couldn't find matching transcriptome, returning un-ranged SummarizedExperiment

here's my salmon meta_info.json file

{
    "salmon_version": "1.1.0",
    "samp_type": "none",
    "opt_type": "vb",
    "quant_errors": [],
    "num_libraries": 1,
    "library_types": [
        "ISR"
    ],
    "frag_dist_length": 1001,
    "seq_bias_correct": false,
    "gc_bias_correct": true,
    "num_bias_bins": 4096,
    "mapping_type": "mapping",
    "num_valid_targets": 227063,
    "num_decoy_targets": 0,
    "num_eq_classes": 1516552,
    "serialized_eq_classes": false,
    "eq_class_properties": [
        "range_factorized"
    ],
    "length_classes": [
        539,
        718,
        1197,
        2359,
        104301
    ],
    "index_seq_hash": "841e05c302e1decc927bd09868588071e9dc3cc36170c767ae9ee4a28079b616",
    "index_name_hash": "29358214629101a93d56e38b317fa702d3f0febbf95cd4d46b617e257068ffc8",
    "index_seq_hash512": "25444653408bdb7cce7213a21cdc5de87cef1201360d3c3d49b56fd721dfce0f8fc46c53978dca8e1868e280ac9b63a7d143206e44f2b22a9c1bb4fcf93a9c01",
    "index_name_hash512": "11bbd339e78c89a8c8066332f6c492644abed71e7c7c50a373306f3061b9bea98f512172ac32e977b2d4a58bfcd261d25b05b8ffa74747a5af34f994b43317ba",
    "index_decoy_seq_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "index_decoy_name_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "num_bootstraps": 0,
    "num_processed": 102768125,
    "num_mapped": 97790293,
    "num_decoy_fragments": 0,
    "num_dovetail_fragments": 399731,
    "num_fragments_filtered_vm": 1246054,
    "num_alignments_below_threshold_for_mapped_fragments_vm": 82635196,
    "percent_mapped": 95.15624908015009,
    "call": "quant",
    "start_time": "Wed Mar 25 07:53:20 2020",
    "end_time": "Wed Mar 25 08:34:27 2020"
}

mikelove commented 4 years ago

Are you using tximeta 1.4.5?

pguilha commented 4 years ago

I was certain it was as I had installed it off bioconductor but it turns out it actually downloads v1.2.2, sorry. I'll figure out how to get 1.4.5 and try again. thanks

pguilha commented 4 years ago

sorry, seems to work fine with 1.4.5!!

thelovelab / tximeta

GENCODE transcriptome hashes not recognized #24