glygener / glygen-issues

Repository for public GlyGen tickets
GNU General Public License v3.0
0 stars 0 forks source link

Instructions for isoform mapping tool #560

Open jeet-vora opened 10 months ago

jeet-vora commented 10 months ago

Preethi can you add all the instructions about the isoform mapping tool here?

pkay47 commented 10 months ago
  1. Install jdk 11 (opensource: https://openjdk.org/projects/jdk/11/)
  2. Download jar https://ftp.ebi.ac.uk/pub/contrib/glygen/previous_releases/2022_01/glygen.jar to
  3. Prepare input 1 - get protein master list of current release in . sample https://docs.google.com/spreadsheets/d/1AVXfw8dd00r70x-eWLw0GsUHNyfCOz-agxHxHwFPcdI/edit?usp=drive_link
  4. Prepare input 2 - sequence of each protein in master list in fasta format in . sample https://drive.google.com/file/d/1MBgTAW4InW0tvVsNeDeuc7fmhJ6lbW_F/view?usp=drive_link
  5. Prepare input 3 - list of isoform accession, aa position & amino acid for mapping in . sample https://drive.google.com/file/d/1x0KZjhf24W6XR7qVb7Tv9mNX-ulFkR2o/view?usp=drive_link
  6. Run app with 3 input parameters & output file_name

    java -classpath \glygen.jar uk.ac.ebi.uniprot.glygen.util.SiteMappingTool -protein_list \human_protein_masterlist.csv -glygen_fasta \human_protein_allsequences.fasta -isoform_info \input.csv -out_file \output.csv

  7. Results in \output.csv 7.1 uniprotkb isoform_ac: first col of input.csv 7.2 aa_pos isoform: second col of input.csv 7.3 amino_acid isoform: third col of input.csv 7.4 uniprotkb canonical_ac: canonical accession of isoform 7.5 aa_pos canonical: aa position in canonical (same as col 2) 7.6 amino_acid canonical: aa at position in canonical 7.7 status: aa at isoform & canonical is same for given position?
    7.8 mapped_pos canonical: same as col 2 if aa's match else actual position in canonical 7.9 aa_mapped can_pos: aa at mapped_pos canonical, same as col 3
pkay47 commented 3 months ago

SiteMappingTool.zip

Java source code.

pkay47 commented 2 weeks ago

@rykahsay Steps to run tool

output.csv has eight cols - isoform_ac, aa_pos, amino_acid, canonical_ac, aa_pos_can, aa_can, status, mapped_can_pos, aa_in_mapped_canonical_pos.

ReneRanzinger commented 1 week ago

Only some changes needed to deal with the data format issue.

pkay47 commented 1 week ago

@rykahsay output will be in 'output.tsv'. Please check