soymine keyword search not returning data for all genomes?

adf-ncgr commented 2 years ago

The given example gene Glyma.01G086900 works just fine as a "keyword" but Glysoja.01G000001 does not (though it is present in the database). Probably related to this is the fact that the "Hits by Organism" facet in the keyword results does not seem to list any species but G. max (based on my limited testing)

sammyjava commented 2 years ago

Must be an old keyword index. Running create-search-index now to see if we get more stuff in. There isn't anything in the mine config that favors one organism over another.

adf-ncgr commented 2 years ago

sounds good.

sammyjava commented 2 years ago

This is a bear. I'm having difficulty getting create-search-index to complete without a connection drop, which crashes the process. I'm running it locally on prod-solr now, which I've done before (probably with soymine). I think the configuration for partial text matches (which I expanded) really bogs it down. May need to limit how many attributes we do that on.

sammyjava commented 2 years ago

So, I managed to build the soymine index (in only 1:25) on shokin-mines with all the partial keyword matches removed. I couldn't do it on prod-solr (crashed after 55m) -- which appears to be a memory issue. shokin-mines has 80G of vRAM. So now I'm thinking of using haldane to index with partial matches (wow an actual LIS use of haldane!). This has arisen because soymine is gotten to be a bigass mine: 1249759 genes, 2010497 proteins. I know Joe has had struggles with solr on his giant mine. Until I removed the partial matches I couldn't get a build on shokin-mines. I'll leave this open as Research Continues.

sammyjava commented 2 years ago

haldane Xmx	prod-solr Xmx	partial_match attributes	time
240g	512m	---	1:30
240g	512m	gene_secondaryidentifier	1:25
240g	512m	+ gene_description	1:30
240g	512m	+ ontologyterm_name, ontologyterm_description, genefamily_description*	1:33
240g	512m	+ dataset, strain, organism_	6:32!
240g	12g	repeat	6:41!
240g	12g	- dataset, strain, organism_	1:28
240g	12g	+ strain_	1:27
240g	12g	+ organism_	6:40!
240g	12g	- organism + dataset	IOException @ 0:40
240g	12g	repeat	1:29
240g	12g	+ organism_, - index.references.BioEntity = organism	0:54
240g	12g	full monty (see list in issue below)	2:28

*switched all text_ngram to from stored="false" to "true" for this round

adf-ncgr commented 2 years ago

I was going to comment earlier that you are becoming the Joe Carlson of legumes. Just so I'm clear on what's at stake here with partial matching, is it basically enabling provided keywords to match if they are substrings of words in the indexed text? e.g. user asks for SRG and it matches SRG1 only if partial matching has been enabled in the indexing? If this is so I'm not sure how important it would be for secondary identifier to allow partial matching (despite use cases like 0010100 that you had used in your graphql demo); and there's probably no reason to partial match primary identifier since the only sensible partial match I can think of there would be returned by a full match against secondary identifier (ie the yuckless substring). Description strings would be good cases for partial matching, though.

sammyjava commented 2 years ago

Agreed, I can be much more judicious in my choice of attributes to partial match. [Edited out my repeat of what you said.] I'm doing a slow course of tests. Another option, BTW, is to build the index entirely on haldane (run a solr while indexing) and copy the index over to prod-solr. You don't need a lot of resources to serve queries but you need them to index. But I'll figure that out in my grid-o-variables.

sammyjava commented 2 years ago

OK this is completely bizarre. The addition of dataset_name/synopsis/description, organism_name/description, and strain_identifier/name/description to partial string indexing adds 5 hours to indexing! That's not much data, there are only 443 datasets, most of which have a description like "Further information provided in 10.1007/s00122-003-1449-z" from the soybase import. I'll add 'em back one by one and see if there's a smoking gun. Or at least a smoking jacket.

sammyjava commented 2 years ago

Smoking gun = organism. Theory:

index.references.BioEntity = synonyms crossReferences organism

I think that since the organism reference for every BioEntity is to be indexed, with organism partial matching it runs a partial match calculation on organism for EVERY BioEntity. Will test by commenting out that directive, we don't use synonyms and cross-references, anyway, and we don't want every glyma BioEntity to be returned for a search on "max".

sammyjava commented 2 years ago

OK I'm done, clearly indexing BioEntity.organism is a Bad Idea. Glad I figured this out since I've struggled with indexing for ages. I'll add in more judicious choices of partial matches on names and descriptions. The search results may be a bit quicker as well, not sure. I did increase the RAM on the solr process as well, since it was silly I wasn't using the available on a dedicated VM.

sammyjava commented 2 years ago

For the record, here are the "final" managed-schema text_ngram entries for soymine:

  <field name="author_name" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="cds_secondaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="cdsregion_secondaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="chromosome_secondaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="dataset_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="dataset_name" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="dataset_synopsis" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="expressionsample_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="expressionsample_name" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="expressionsource_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="expressionsource_synopsis" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="gene_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="gene_secondaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="genefamily_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="geneticmap_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="geneticmap_primaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="geneticmap_synopsis" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="goterm_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="goterm_name" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="gwas_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="gwas_primaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="gwas_synopsis" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="mrna_secondaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="ontologyterm_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="ontologyterm_name" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="organism_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="organism_name" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="pathway_name" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="protein_secondaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="qtl_primaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="qtlstudy_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="qtlstudy_primaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="qtlstudy_synopsis" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="strain_description" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="strain_identifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="strain_name" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>
  <field name="trait_primaryidentifier" type="text_ngram" multiValued="true" indexed="true" required="false" stored="true"/>

legumeinfo / mine-issues

soymine keyword search not returning data for all genomes? #63