opencb / cellbase

High-Performance NoSQL database and RESTful web services to access to most relevant biological data
Apache License 2.0
89 stars 53 forks source link

Transcripts end point broken when using a replica set #557

Open julie-sullivan opened 3 years ago

julie-sullivan commented 3 years ago

CellBase (for Transcripts ONLY) is sorting after the pagination. It must sort before the SKIP and LIMIT are being applied. If there is a replicaset present, then the query results will be incorrect.

In the database adapter I did this:

Bson sort = MongoDBQueryUtils.getSort(options); << get SORT from query
options.remove(QueryOptions.SORT); << remove SORT so we don't sort twice!

aggregateList.add(match);
aggregateList.add(sort); << Add SORT here, right after genes
aggregateList.add(unwind);
aggregateList.add(match2);
aggregateList.add(excludeAndInclude);
aggregateList.add(project);

I also tried sort after the projection. The files were the same but they were truncated as the SORT failed:

Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting.

You can opt in to external sorting: https://docs.mongodb.com/manual/reference/command/aggregate/#std-label-aggregate-cmd-allowDiskUse

Going to test this.

julie-sullivan commented 3 years ago
cellbase_transcript_client.search(
    biotype=self._relevant_biotypes,
    include=self.CELLBASE_TRANSCRIPT_QUERY_INCLUDE,
    assembly=self._assembly,
    annotationFlags=InterpretationProcess.GENE_CODE_BASIC_TRANSCRIPT_SET,
    sort='id',
)

if you run this script more than once, you will get different results.

julie-sullivan commented 3 years ago

For the above to work, you need an additional index. {"transcripts.biotype":1, id:1}