smith-chem-wisc / MetaMorpheus

Proteomics search software with integrated calibration, PTM discovery, bottom-up, top-down and LFQ capabilities
MIT License
90 stars 46 forks source link

Question: How to specify custom set of proteins #2205

Open MurphyDavid opened 1 year ago

MurphyDavid commented 1 year ago

Hi,

A couple of questions as a new user:

I'd like to run this from the command line non-interactively, when using the docker version is there any way to run it without it pausing on this expecting user input? some equivilent of a "/Y" or "-Y" option

In order to search Thermo .raw files, you must agree to the above terms. Do you agree to the above terms? y/n

My main query is about how to input custom protein sequences to search:

I've got a list of amino acid sequences and I am trying to use MetaMorpheus to search mas-spec data for matches. I'm trying to use the MetaMorpheusVignette-selected-examples as a guide on how to input data.

Do I understand correctly that the uniprot-cRAP-1-24-2018.xml.gz is for potential contaminants? And that the uniprot-mouse-reviewed-1-24-2018.xml.gz is for the sequences you're actually interested in searching for?

I've attempted to make my own equivalent of the "reviewed" xml file using this format but with multiple blocks:

<?xml version='1.0' encoding='UTF-8'?>
<uniprot
    xmlns="http://uniprot.org/uniprot"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
    <entry dataset="Custom-1">
        <name>ABC.123</name>
        <protein>
            <recommendedName>
                <fullName evidence="5">ABC.123</fullName>
            </recommendedName>
        </protein>
        <sequence length="42" mass="4705">
MTPRSALPKQFRVHFGTQLGTEQSNVWFGIPSGVFFRISALV
</sequence>
    </entry>
</uniprot>

Running it against the tutorial example data it seems to run without complaint or error but it seems to run too fast, even with a few hundred sequences it blinks through each stage of the search fast enough that I don't think it's actually searching and I don't know how to check what sequences MetaMorpheus has successfully loaded.

trishorts commented 1 year ago
  1. yes you can modify your installation so that it doesn't pause to ask about the thermo license. You need to find the folder where MetaMorpheus is installed. That directory will contain the following folders and files.

image

You will need to edit the settings.toml in a text editor. Set UserHasAgreedToThermoRawFileReaderLicence = true

image

  1. Yes, the crap file contains contaminants. You can use that or not as you see fit.
  2. To confirm that metamorpheus is using your sequences, open the AutoGeneratedManuscriptProse.txt file in the Task-1SearchTask folder. This file will tell you which file was used as the sequence file and how many sequences were used. In your case, do a search with only your list of sequences and see if this document matches your expectations.

image

Also, feel free to send me your sequence file and i will test it myself. mm_support@chem.wisc.edu

MurphyDavid commented 1 year ago

Unfortunately I've still had no luck with this. I've tried to create a minimum file with 2 entries, one entry a shortened one from the original example file and one with a custom sequence.

<?xml version='1.0' encoding='UTF-8'?>
<uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">

<entry dataset="Swiss-Prot">

<name>ZSWM2_MOUSE</name>
<protein>
<recommendedName>
<fullName>E3 ubiquitin-protein ligase Zswim2</fullName>
</recommendedName>
</protein>
<gene>
<name type="primary">Zswim2</name>
</gene>

<sequence length="631" mass="71793" >
MLRGGCKASEKRRHLSESLSWQQDQALSSSIYLLRQIGPTGFLLKEEEPEKGDFRVLLGN
PHECSCPTFLKRGELCKHICWVLLKKFKLPRNHESAFQLGLTEGEINDLLRGIHQVQAPQ
LRASDETAQVEEDGYLKQKDINAGDICPICQEVLLEKKLPVTFCRFGCGNNVHIKCMRIL
ANYQDTGSDSSVLRCPLCREEFAPLKVILEEFKNSNKLITISEKERLDKHLGIPCNNCNQ
LPIEGRCYKCTECVEYHLCQECFDSCCHSSHAFASREKRNQRWRSVEKRSEVMKYLNTEN
EGEAKPGCFQEKQGQFYTPKHVVKSLPLLMITKKSKLLAPGYQCRLCLKSFSFGQYTRLL
PCTHKFHRKCIDNWLLHKCNSCPIDRQVIYNPLIWKGIATDGQAHQLASSKDIACLSKQQ
EPKLFIPGTGLVLKGKRMGVLPSIPQYNSKVLTTLQNPSDNYQNITMDDLCSVKLDNSNS
RKLVFGYKISKQFPTYLKNPTTGQTPSQTFLPSLPHKNIICLTGRESPHIYEKDHIGQSQ
KTSRGYEHINYNTRKSLGSRLRQHKRSSALSSEDLNLTINLGTTKLSLSKRQNNSMGKVR
QKLGHPPRRPAYPPLQTQNAALSLIMQGIQL
</sequence>

</entry>

<entry dataset="Swiss-Prot">

<name>ZSWM2_MOUSEXX</name>
<protein>
<recommendedName>
<fullName>E3 ubiquitin-protein ligase Zswim2XX</fullName>
</recommendedName>
</protein>
<gene>
<name type="primary">Zswim2XX</name>
</gene>

<sequence length="87" mass="8872" >
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVATVAEKTK
EQVTNVGGAVLTDVPSCTSAQFQCAQS
</sequence>
</entry>

</uniprot>

image

the log shows it seems to read in the 1 sequence but then reverts to zero, I can't seem to get it to recognise the second sequence at all:

image

Edit: I've just realised I'm an idiot and "reset tasks" doesn't actually reset it completely such that the generated AutoGeneratedManuscriptProse.txt includes lines for previous runs.

Additional edit:

I just realised your email reply had gotten marked as spam, thanks very much, being able to use fasta forma should be perfect


Replace the info in yellow with appropriate info for each sequence 

>sp|A0PK11|CLRN2_HUMAN Clarin-2 OS=Homo sapiens OX=9606 GN=CLRN2 PE=1 SV=1
MPGWFKKAWYGLASLLSFSSFILIIVALVVPHWLSGKILCQTGVDLVNATDRELVKFIGD
IYYGLFRGCKVRQCGLGGRQSQFTIFPHLVKELNAGLHVMILLLLFLALALALVSMGFAI
LNMIQVPYRAVSGPGGICLWNVLAGGVVALAIASFVAAVKFHDLTERIANFQEKLFQFVV
VEEQYEESFWICVASASAHAANLVVVAISQIPLPEIKTKIEEATVTAEDILY