richelbilderbeek / bbbq_article

The Bianchi Bilderbeek Bogaart Question answered
GNU General Public License v3.0
0 stars 0 forks source link

Fix netmhc2pan::run_netmhc2pan: Wrong input format #55

Closed richelbilderbeek closed 4 years ago

richelbilderbeek commented 4 years ago

Needed for #54.

From log file: run_r_script_13381509.log

Rscript create_counts_per_proteome.R human h14
haplotype: HLA-DRB1*0101
mhc_class: 2
peptide_length: 13
ic50_prediction_tool: netmhc2pan
Error in netmhc2pan::run_netmhc2pan(fasta_filename = temp_fasta_filename,  : 
  ERROR: Wrong input format.
Calls: <Anonymous> -> <Anonymous> -> <Anonymous> -> <Anonymous>
richelbilderbeek commented 4 years ago

Reinstalled netmhc2pan, then try:

p230198@peregrine:bbbq_1_fast sbatch run_r_script.sh create_counts_per_proteome.R human h14
Submitted batch job 13422720
richelbilderbeek commented 4 years ago

Great, the error message has already improved, from

p230198@peregrine:bbbq_1_fast cat $(ls | egrep 720)
Rscript create_counts_per_proteome.R human h14
haplotype: HLA-DRB1*0101
mhc_class: 2
peptide_length: 13
ic50_prediction_tool: netmhc2pan
Error in netmhc2pan::run_netmhc2pan(fasta_filename = temp_fasta_filename,  : 
  ERROR: Wrong input format.
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> <Anonymous> -> <Anonymous>
Execution halted

to

p230198@peregrine:bbbq_1_fast cat $(ls | egrep 791)
Rscript create_counts_per_proteome.R human h14
haplotype: HLA-DRB1*0101
mhc_class: 2
peptide_length: 13
ic50_prediction_tool: netmhc2pan
Error in netmhc2pan::run_netmhc2pan(fasta_filename = temp_fasta_filename,  : 
  NetMHCIIpan output file no created. alleles_as_word: 'DRB1_0101,'. peptide_length: '13'. temp_xls_filename: '/home/p230198/.cache/temp_7b6579c3a6b9.xls'. temp_xls_filename: '/home/p230198/.cache/temp_7b6579c3a6b9.xls'. fasta_filename: '/home/p230198/.cache/temp_7b652a5c1b31.fasta'. utils::head(readLines(fasta_filename)): '>seq1 
GTGG'. NetMHCII error output: 'ERROR: Wrong input format.'
Calls: <Anonymous> ... <Anonymous> -> <Anonymous> -> <Anonymous> -> <Anonymous>
Execution halted
richelbilderbeek commented 4 years ago

The FASTA that fails:

p230198@peregrine:bbbq_1_fast cat /home/p230198/.cache/temp_7b652a5c1b31.fasta
>seq1
GTGG

it is remarkably short :confused:

richelbilderbeek commented 4 years ago

Added more descriptive errors in netmhc2pan. Also: bbbq checks for short sequence lengths.

Run again, after installing bbbq:

p230198@peregrine:bbbq_1_fast q
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) 
          13427123   regular install_  p230198  R       0:46      1 pg-node210 
p230198@peregrine:bbbq_1_fast sbatch --dependency=afterok:13427123 run_r_script.sh create_counts_per_proteome.R human h14
Submitted batch job 13427149
richelbilderbeek commented 4 years ago

In #56 I found out that there is a sequence with a 'U' amino acids that caused the problem.

richelbilderbeek commented 4 years ago

Hmmm, NetMHC2pan does handle U:

test_that("NetMHC2pan does handle U", {
  if (!pureseqtmr::is_pureseqtm_installed()) return()
  if (!netmhc2pan::is_netmhc2pan_installed()) return()

  protein_sequences <- c(
    "FANTASTICALLY",
    "FANTASTICALLU"
  )
  t <- predict_counts_per_proteome(
    protein_sequences = protein_sequences,
    haplotype = get_mhc2_haplotypes()[1],
    peptide_length = 13,
    percentile = 0.05,
    ic50_prediction_tool = "netmhc2pan"
  )
  t
  expect_true(tibble::is_tibble(t))
  expect_equal(nrow(t), length(protein_sequences))
  expect_true("n_binders" %in% names(t))
  expect_true("n_binders_tmh" %in% names(t))
  expect_true("n_spots" %in% names(t))
  expect_true("n_spots_tmh" %in% names(t))
})
richelbilderbeek commented 4 years ago

This is great: it now fails fast:

p230198@peregrine:bbbq_1_fast cat $(ls | egrep 13427149)
Rscript create_counts_per_proteome.R human h14
haplotype: HLA-DRB1*0101
mhc_class: 2
peptide_length: 13
ic50_prediction_tool: netmhc2pan
Error in bbbq::check_protein_sequences_length(protein_sequences = protein_sequences,  : 
  Protein sequence shorter than epitope peptide length .
'peptide_length': 13 
sequence index: 1217 
sequence length: 4 
sequence: GTGG
Calls: <Anonymous> -> <Anonymous>
Execution halted

###############################################################################
Peregrine Cluster
Job 13427149 for user 'p230198'
Finished at: Thu Aug 27 09:23:55 CEST 2020

Job details:
============

Name                : run_r_script
User                : p230198
Partition           : regular
Nodes               : pg-node193
Cores               : 1
State               : FAILED
Submit              : 2020-08-27T09:13:45
Start               : 2020-08-27T09:23:47
End                 : 2020-08-27T09:23:55
Reserved walltime   : 10-00:00:00
Used walltime       :    00:00:08
Used CPU time       :    00:00:02 (efficiency: 37.19%)
% User (Computation): 69.21%
% System (I/O)      : 30.76%
Mem reserved        : 10G/node
Max Mem used        : 0.00  (pg-node193)
Max Disk Write      : 0.00  (pg-node193)
Max Disk Read       : 0.00  (pg-node193)

Acknowledgements:
=================

Please see this page for information about acknowledging Peregrine in your publications:

https://wiki.hpc.rug.nl/peregrine/additional_information/scientific_output

################################################################################
richelbilderbeek commented 4 years ago

Aha, there is a very small peptide in the human proteome:

p230198@peregrine:bbbq_1_fast cat human_proteins_lut.csv | egrep "GTGG$"
p1217,sp|P0DPI4|TDB01_HUMAN,GTGG
richelbilderbeek commented 4 years ago

Checking in the original file:

p230198@peregrine:bbbq_1_fast cat human.fasta | egrep "^GTGG$" -A 1 -B 1
>sp|P0DPI4|TDB01_HUMAN T cell receptor beta diversity 1 OS=Homo sapiens OX=9606 GN=TRBD1 PE=4 SV=1
GTGG
>sp|Q6N021|TET2_HUMAN Methylcytosine dioxygenase TET2 OS=Homo sapiens OX=9606 GN=TET2 PE=1 SV=3

Indeed, the human reference proteome contains a 4-mer!

richelbilderbeek commented 4 years ago

Created #58 to handle this Issue, now knowing the cause.