Word clouds based on member (gene) descriptions

krassowski commented 3 years ago

This is just an idea for an example I guess. Would it be possible to facilitate creating the word cloud based on the descriptions of pathway members (e.g. genes) rather than on the descriptions of the pathway itself?

Currently one can manually adjust the terms (which is great!) and for example use descriptions instead of terms. This is already an interesting alternative which might be worth documenting, as the example heatmap:

Becomes:

when using AnnotationDbi::select(GO.db::GO.db, keys = x, columns = "DEFINITION")$DEFINITION) (which is not trivial to change - would it be a good idea to make the TERM/DEFINITION a parameter?)

I would like to go a step further and for each pathway concatenate descriptions of all genes that were included; the gene/proteins descriptions could come from RefSeq, Uniprot, or any of the ontology databases. My expectation would be that those are provided by an advanced user in a form of a named character vector, e.g.:

member_descriptions=c(
   'TP53'='This gene encodes a tumor suppressor protein containing transcriptional activation, DNA binding, and oligomerization domains. The encoded protein responds to diverse cellular stresses to regulate expression of target genes, thereby inducing cell cycle arrest, apoptosis, senescence, DNA repair, or changes in metabolism. Mutations in this gene are associated with a variety of human cancers, including hereditary cancers such as Li-Fraumeni syndrome. Alternative splicing of this gene and the use of alternate promoters result in multiple transcript variants and isoforms. Additional isoforms have also been shown to result from the use of alternate translation initiation codons from identical transcript variants (PMIDs: 12032546, 20937277). [provided by RefSeq, Dec 2016]',
   'BRCA1'='This gene encodes a 190 kD nuclear phosphoprotein that plays a role in maintaining genomic stability, and it also acts as a tumor suppressor. The BRCA1 gene contains 22 exons spanning about 110 kb of DNA. The encoded protein combines with other tumor suppressors, DNA damage sensors, and signal transducers to form a large multi-subunit protein complex known as the BRCA1-associated genome surveillance complex (BASC). This gene product associates with RNA polymerase II, and through the C-terminal domain, also interacts with histone deacetylase complexes. This protein thus plays a role in transcription, DNA repair of double-stranded breaks, and recombination. Mutations in this gene are responsible for approximately 40% of inherited breast cancers and more than 80% of inherited breast and ovarian cancers. Alternative splicing plays a role in modulating the subcellular localization and physiological function of this gene. Many alternatively spliced transcript variants, some of which are disease-associated mutations, have been described for this gene, but the full-length natures of only some of these variants has been described. A related pseudogene, which is also located on chromosome 17, has been identified. [provided by RefSeq, May 2020]'
)

And simplifyEnrichment would take the responsibility of concatenating them creating one document per pathway. This could be just a helper function exposed to the users, and the user would need to pass the result as term argument to anno_word_cloud. A special case could be made for anno_word_cloud_from_GO() where this would be handled for the user if they ask for it.

krassowski commented 3 years ago

By the way, a concatenation of DEFINITION and TERM columns might be interesting too. I would expect the most important terms to be repeated in both the term name and definition which could give much better results.

Sorry about the deluge of issues @jokergoo. This is the last one for today, I will keep myself busy with other things now. Please let me know if you would like me help addressing the suggestions I added, or whether you prefer to make decisions and code on your own. In either case - thanks for your awesome work!

jokergoo commented 3 years ago

That is totally fine, and thank your for all your comments and suggestions! I will look into them in the next few days!

jokergoo commented 3 years ago

Using gene description/summary to construct word cloud is a great idea! I would like to support it in the package. It seems currently no annotation package provide gene description information (only gene names). Also it seems only RefSeq database provides such information. Then I will manually collect such information.

jokergoo commented 3 years ago

It seems we need to perform some word analysis if using refseq gene description for word cloud. Some words need to be put into the blacklist (e.g. gene, encode, family, ...)

jokergoo / simplifyEnrichment

Word clouds based on member (gene) descriptions #56