Closed coleslaw481 closed 6 months ago
Ran into an issue. If an attribute value has newlines it messes up the output of hierarchy_node_attributes.tsv
file. This happens because the LLM outputs new lines in its output as seen in this example output:
name represents CD_MemberList CD_MemberList_Size CD_MemberList_LogSize CD_AnnotatedMembers CD_AnnotatedMembers_Size CD_AnnotatedMembers_Overlap CD_AnnotatedMembers_Pvalue HiDeF_persistence CD_CommunityName CD_Labeled HCX::isRoot HCX::members CORUM_terms CORUM_FDRs CORUM_jaccard_indexes CORUM_overlap_genes GO_CC_terms GO_CC_descriptions GO_CC_FDRs GO_CC_jaccard_indexes GO_CC_overlap_genes GO_CC_max_jaccard_index HPA_terms HPA_FDRs HPA_jaccard_indexes HPA_overlap_genes ollama_llama2:latest::_process ollama_llama2:latest::_confidence ollama_llama2:latest::_raw
C22 C22 BAZ1B CHD1 MORF4L1 CHD4 DNMT3A USP7 HDAC2 SMARCA4 AURKB DNMT1 KMT2C PHF6 BRD7 CBX3 KMT2D HDAC1 EHMT1 YWHAE YWHAB ING1 PRMT1 PPP1CB 22 4.459 0 0.0 0.0 93 C22 True True [7, 18, 13, 4, 20, 14, 17, 10, 9, 15, 11, 8, 3, 16, 5, 0, 6, 19, 21, 2, 1, 12] GO:0043231|GO:0043227|GO:0043226|GO:0043229|GO:0005622|GO:0110165|GO:0005575 intracellular membrane-bounded organelle|membrane-bounded organelle|organelle|intracellular organelle|intracellular anatomical structure|cellular anatomical entity|cellular_component 1.00e+00|1.00e+00|1.00e+00|1.00e+00|1.00e+00|1.00e+00|1.00e+00 1.0|1.0|1.0|1.0|1.0|1.0|1.0 CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1|CHD1,EHMT1,YWHAE,PPP1CB,CHD4,DNMT1,CBX3,KMT2C,ING1,SMARCA4,DNMT3A,KMT2D,AURKB,BRD7,PHF6,USP7,MORF4L1,YWHAB,PRMT1,HDAC2,BAZ1B,HDAC1 1.0 Transcriptional elongation and termination 0.8 Process: Transcriptional elongation and termination
Confidence Score: 0.8
The system of interacting proteins includes a variety of transcription factors, chromatin modulators, and other regulatory proteins that work together to facilitate the transcription of genetic information into RNA molecules. BAZ1B, CHD1, MORF4L1, CHD4, DNMT3A, USP7, HDAC2, SMARCA4, AURKB, DNMT1, KMT2C, PPH6, BRD7, CBX3, KMT2D, HDAC1, EHMT1, YWHAE, YWHAB, ING1, and PRMT1 are some of the key players in this process. These proteins interact with each other and with DNA to regulate gene expression during transcription elongation and termination.
The proteins in this system are involved in multiple steps of transcription, including initiation, elongation, and termination. BAZ1B, CHD1, and MORF4L1 are transcriptional activators that bind to enhancer elements and promote the recruitment of RNA polymerase during transcription initiation. CHD4 is a chromatin remodeler that facilitates the accessibility of DNA to the transcriptional machinery during elongation, while USP7 is an ubiquitin ligase that degrades transcriptional regulators during termination. DNMT3A and DNMT1 are important for establishing DNA methylation patterns, which can affect gene expression throughout the cell.
C24 C24 USP7 HDAC2 YWHAB DNMT1 KMT2D MORF4L1 YWHAE PHF6 ING1 9 3.17 0 0.0 0.0 37 C24 True False [21, 2, 11, 17, 16, 3, 1, 6, 0] Transcriptional elongation and RNA polymerase II C-terminal domain modification. 0.95 Process: Transcriptional elongation and RNA polymerase II C-terminal domain modification.
The system of interacting proteins consists of USP7, HDAC2, YWHAB, DNMT1, KMT2D, MORF4L1, YWHAE, PHF6, and ING1. These proteins work together to facilitate transcriptional elongation and modify the C-terminal domain of RNA polymerase II (RNAPII). Transcriptional elongation is a critical biological process that involves the sequential addition of nucleotides to a growing RNA chain during gene expression. The C-terminal domain of RNAPII plays a crucial role in this process, as it interacts with transcription elongation factors and helps to stabilize the RNA polymerase complex.
Add support to run ollama models in this tool. There should be a command line flag --ollama_models that takes one or more comma delimited parameters of format
<MODEL>:<PROMPT or PROMPT FILE>
and the prompt should include the following token that will be replaced by the geneset{GENE_SET}
In addition, the prompt should tell the LLM to output the following two lines at the top:Example prompt already included in dev branch: