Closed marcosmro closed 3 years ago
It seems that the issue with missing entries was caused by the disk running low on space. When that happens, Elasticsearch puts itself into read-only mode and CEDAR throws the following message:
WARN [2021-06-15 10:13:52,866] org.metadatacenter.cedar.util.dw.CedarCedarExceptionMapper: :CCEM:msg :blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];
We fixed the missing entries issue by truncating the log_cypher table and disabling the read_only_allow_delete
flag in Elasticsearch as follows:
PUT /_all/_settings
{
"index.blocks.read_only_allow_delete": null
}
Our short-term plan to address the disk space issue is described at https://github.com/metadatacenter/cedar-project/issues/1134
Here is a brief description of the issue that caused duplicated index entries and the approach used to fix it:
The caDSR CDEs ingestion tool makes use of two different endpoints to upload CDEs to CEDAR:
This issue was related to the index update performed in E2. Now, focusing on the Elasticsearch index, here are the actions taken to create a CDE:
a. Create a CDE index entry (without any associated categories) (done by E1). b. Find the existing index entry for the CDE (search by CEDAR id) (done by E2). c. Delete the existing index entry for the CDE using the index identifier (done by E2). d. Create a new index entry for the CDE, including the category identifiers (done by E2).
The index ends up with duplicate index entries for CDEs when there is not enough time between a) and b). These steps happen sequentially but, in Elasticsearch, there is a delay (by default, 1 second) from the time a document is created until it’s visible (searchable). Therefore, when b) happens in less than 1 second from a), the index entry associated with the CDE won’t be found and won’t be deleted. Consequently, after E2, the index will contain two indexed documents for a given CDE: one without the categories and one with the categories.
There are ways in Elasticsearch both to force an index refresh and decrease the refresh interval, but refreshing an index takes up considerable resources, and taking those actions is not recommended.
The approach used to solve this issue is twofold:
An obvious alternative to the described approach would be to develop a new endpoint that does all the work in E1 and E2 in just one call, that is, creates the CDE, associates the CDEs to categories, and creates the corresponding index entry once, at the end of the process.
I've noticed that some CDE fields are duplicated in our Elasticsearch index (e.g., @id = https://repo.metadatacenter.org/template-fields/01a811c6-cf28-4a47-bccf-727191c5602a), while others are missing. I'm trying to find the cause of this issue.