API server of the cs-insights project. This is the main part of storing data and accessing an external data analysis endpoint. It uses a mongoDB instance to store everything and queries the cs-insights-prediction-endpoint to get machine learning results.
Is your feature request related to a problem? Please describe.
Once https://github.com/ag-gipp/NLP-Land-backend/issues/24 is solved some queries might not perform well anymore. The /info, /quartiles, and topk endpoints take very long to respond for authors (around 1 minute). The issue comes from MongoDB has to:
$unwind 5 million papers, into 10+ million items
$group those into 2.7 million group
$sort those 2.7 million authors (without index)
Describe the solution you'd like
Optimize all queries that do not perform well and fix any workarounds.
[X] Fix the paged endpoint: It has issues with the $lookup/$sort/$project in the pipeline, so we changed the order of the pipeline as a workaround. This returns incorrect results when we sort by venue or authors. Originally the $project stage was before the $sort stage. done, but there is a new issue
[x] Change the schema, so all information is duplicated into each author. This will make sure all filters can be applied to each author and without $unwind/$group or $lookup.
Describe alternatives you've considered
Should the queries without filters still take too long we could add some default values for filters.
Is your feature request related to a problem? Please describe. Once https://github.com/ag-gipp/NLP-Land-backend/issues/24 is solved some queries might not perform well anymore. The
/info
,/quartiles
, andtopk
endpoints take very long to respond for authors (around 1 minute). The issue comes from MongoDB has to:Describe the solution you'd like Optimize all queries that do not perform well and fix any workarounds.
Fix the paged endpoint: It has issues with the $lookup/$sort/$project in the pipeline, so we changed the order of the pipeline as a workaround. This returns incorrect results when we sort by venue or authors. Originally the $project stage was before the $sort stage.done, but there is a new issueDescribe alternatives you've considered Should the queries without filters still take too long we could add some default values for filters.