idekerlab / cellmaps_hierarchyeval

Code for hierarchy evaluation
MIT License
0 stars 0 forks source link

Add support to run ollama models #7

Closed coleslaw481 closed 6 months ago

coleslaw481 commented 7 months ago

Add support to run ollama models in this tool. There should be a command line flag --ollama_models that takes one or more comma delimited parameters of format <MODEL>:<PROMPT or PROMPT FILE> and the prompt should include the following token that will be replaced by the geneset {GENE_SET} In addition, the prompt should tell the LLM to output the following two lines at the top:

Process: <name>
Confidence Score: <score>

Example prompt already included in dev branch:

Write a critical analysis of the biological processes performed by this system of interacting
proteins.

Base your analysis on prior knowledge available in your training data.  After completing your
analysis, propose a brief and
detailed name for the most prominent biological process performed by the system.

After completing your analysis, please also assign a confidence score to the process name you
selected.  This score should follow the name in parentheses and range from 0.00 to 1.00. A
score of 0.00 indicates the lowest confidence, while 1.00 reflects the highest confidence.
This score helps gauge how accurately the chosen name represents the functions and activities
within the system of interacting proteins. When determining your score, consider the
proportion of genes in the protein system that participate in the identified biological
process. For instance, if you select "Ribosome biogenesis" as the process name but only a
few genes in the system contribute to this process, the score should be lower compared to a
scenario where a majority of the genes are involved in "Ribosome biogenesis".

Put your chosen name at the top of the analysis as 'Process: <name>'.
Put your assigned confidence score on the next line as 'Confidence Score: <score>'.

Be concise, do not use unnecessary words.
Be factual, do not editorialize.
Be specific, avoid overly general statements such as 'the proteins are involved in various
cellular processes'. Avoid listing facts about individual proteins. Instead, try to group
proteins with similar functions and discuss their interplay, synergistic or antagonistic
effects and functional integration within the system. Also avoid choosing generic process
names such as 'Cellular Signaling and Regulation'.  If you cannot identify a prominent
biological process for the proteins in the system, I want you to communicate this in you
analysis and name the process: "System of unrelated proteins". Provide a score of 0.00 for
a "System of unrelated proteins".

Here are the interacting proteins: {GENE_SET}
coleslaw481 commented 7 months ago

Ran into an issue. If an attribute value has newlines it messes up the output of hierarchy_node_attributes.tsv file. This happens because the LLM outputs new lines in its output as seen in this example output:

name    represents      CD_MemberList   CD_MemberList_Size      CD_MemberList_LogSize   CD_AnnotatedMembers     CD_AnnotatedMembers_Size        CD_AnnotatedMembers_Overlap     CD_AnnotatedMembers_Pvalue        HiDeF_persistence       CD_CommunityName        CD_Labeled      HCX::isRoot     HCX::members    CORUM_terms     CORUM_FDRs      CORUM_jaccard_indexes     CORUM_overlap_genes     GO_CC_terms     GO_CC_descriptions      GO_CC_FDRs      GO_CC_jaccard_indexes   GO_CC_overlap_genes     GO_CC_max_jaccard_index HPA_terms       HPA_FDRs  HPA_jaccard_indexes     HPA_overlap_genes       ollama_llama2:latest::_process  ollama_llama2:latest::_confidence       ollama_llama2:latest::_raw      
C22     C22     BAZ1B CHD1 MORF4L1 CHD4 DNMT3A USP7 HDAC2 SMARCA4 AURKB DNMT1 KMT2C PHF6 BRD7 CBX3 KMT2D HDAC1 EHMT1 YWHAE YWHAB ING1 PRMT1 PPP1CB      22      4.459           0         0.0     0.0     93      C22     True    True    [7, 18, 13, 4, 20, 14, 17, 10, 9, 15, 11, 8, 3, 16, 5, 0, 6, 19, 21, 2, 1, 12]                                  GO:0043231|GO:0043227|GO:0043226|GO:0043229|GO:0005622|GO:0110165|GO:0005575      intracellular membrane-bounded organelle|membrane-bounded organelle|organelle|intracellular organelle|intracellular anatomical structure|cellular anatomical entity|cellular_component    1.00e+00|1.00e+00|1.00e+00|1.00e+00|1.00e+00|1.00e+00|1.00e+00  1.0|1.0|1.0|1.0|1.0|1.0|1.0     CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1        1.0                                     Transcriptional elongation and termination      0.8     Process: Transcriptional elongation and termination

Confidence Score: 0.8

The system of interacting proteins includes a variety of transcription factors, chromatin modulators, and other regulatory proteins that work together to facilitate the transcription of genetic information into RNA molecules. BAZ1B, CHD1, MORF4L1, CHD4, DNMT3A, USP7, HDAC2, SMARCA4, AURKB, DNMT1, KMT2C, PPH6, BRD7, CBX3, KMT2D, HDAC1, EHMT1, YWHAE, YWHAB, ING1, and PRMT1 are some of the key players in this process. These proteins interact with each other and with DNA to regulate gene expression during transcription elongation and termination.

The proteins in this system are involved in multiple steps of transcription, including initiation, elongation, and termination. BAZ1B, CHD1, and MORF4L1 are transcriptional activators that bind to enhancer elements and promote the recruitment of RNA polymerase during transcription initiation. CHD4 is a chromatin remodeler that facilitates the accessibility of DNA to the transcriptional machinery during elongation, while USP7 is an ubiquitin ligase that degrades transcriptional regulators during termination. DNMT3A and DNMT1 are important for establishing DNA methylation patterns, which can affect gene expression throughout the cell.
C24     C24     USP7 HDAC2 YWHAB DNMT1 KMT2D MORF4L1 YWHAE PHF6 ING1    9       3.17            0       0.0     0.0     37      C24     True    False   [21, 2, 11, 17, 16, 3, 1, 6, 0]                                                                                                                   Transcriptional elongation and RNA polymerase II C-terminal domain modification.  0.95    Process: Transcriptional elongation and RNA polymerase II C-terminal domain modification.

The system of interacting proteins consists of USP7, HDAC2, YWHAB, DNMT1, KMT2D, MORF4L1, YWHAE, PHF6, and ING1. These proteins work together to facilitate transcriptional elongation and modify the C-terminal domain of RNA polymerase II (RNAPII). Transcriptional elongation is a critical biological process that involves the sequential addition of nucleotides to a growing RNA chain during gene expression. The C-terminal domain of RNAPII plays a crucial role in this process, as it interacts with transcription elongation factors and helps to stabilize the RNA polymerase complex.