Combine synchronized storing and indexing, re-organize components, HAL support (no more DOI centric approach), dependency update, import format update

kermitt2 commented 7 months ago

This is a working PR for version 0.3, which introduces many changes:

storing (on LMDB) and indexing (on ElasticSearch cluster) are now done in one step (no more Node.js separate indexing), in an asynchronous manner. This solves supporting bibliographical records beyond DOI as the record naming identifier is centralized. Storing and indexing are done in parallel and quite optimized (load time should be relatively similar to the previous load time, but we have the indexing done in addition)
only one Java application for everything (removal of the Node.js part, which was fun to write, but ultimately a bad idea as well noted by karatekaneen :D )
update to Dropwizard 4, change to logback, update of various other dependencies
support of ElasticSearch 8.* and the latest Java client API for indexing (still the Java High Level REST Client for retrieval for the moment)
update of file import formats for the CrossRef dump flavors (using #83 from @lfoppiano)
support of HAL archive, loading via the HAL web API (10 times faster than OAI-PMH), query integrating HAL ID and results integrating full HAL records (with or without DOI)

Todo:

Solr cluster support (ongoing), as a more open search framework alternative... it will likely replace entirely ElasticSearch given its new license
load and index PubMed full records (via Medline dumps), the parsing and JSON conversion are done, only need a proper command and a lookup class. This will support all the PubMed entries without DOI for the reference matching (around 8M records)

karatekaneen commented 5 months ago

Would love to help to get this completed. What's left to do before this can be merged?

kermitt2 commented 5 months ago

Hi @karatekaneen ! I hope you're doing well. It's actually complete - except what is in the todo list, but that would be too much for this PR. I was waiting for some feedback from a user, but it's fully functional according to my tests. I will try to merge it next week-end after quickly reviewing the documentation.

karatekaneen commented 5 months ago

Lovely! I'll take it for a test spin as soon as I get the chance. Haven't worked with Solr before so might be a bit tricky to set up for me

karatekaneen commented 5 months ago

The tests seems to be broken which makes the gradlew clean build command fail. When running gradlew clean jar instead to skip the tests it seems to work and the server starts up. Waiting for the data to download and then I'll try to get everything up and running.

Here's some of the output from the failed step:

> Task :compileTestJava FAILED
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/UnpaidWallReaderTest.java:3: error: package com.scienceminer.lookup.data does not exist
import com.scienceminer.lookup.data.UnpayWallMetadata;
                                   ^
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/UnpaidWallReaderTest.java:14: error: cannot find symbol
    UnpayWallReader target;
    ^
  symbol:   class UnpayWallReader
  location: class UnpaidWallReaderTest
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/PmidReaderTest.java:13: error: cannot find symbol
    PmidReader target;
    ^
  symbol:   class PmidReader
  location: class PmidReaderTest
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/IstexIdsReaderTest.java:3: error: package com.scienceminer.lookup.data does not exist
import com.scienceminer.lookup.data.IstexData;

kermitt2 commented 5 months ago

Sorry I forgot working on the tests! They have been updated.

karatekaneen commented 5 months ago

@kermitt2 Tried it out and it works perfectly. Haven't tried the HAL stuff though since it's of no interest for us. Awesome work!

The only thing I've noticed is that the indexing was quite slow. I used a VM with 2cpu + 6gb RAM and something similar for the Elasticsearch instance. None of them ran at max capacity on cpu or memory but the indexing took 80hrs (~300 items/sec). Not a big deal since it's done now but worth mentioning.

kermitt2 commented 5 months ago

The only thing I've noticed is that the indexing was quite slow. I used a VM with 2cpu + 6gb RAM and something similar for the Elasticsearch instance. None of them ran at max capacity on cpu or memory but the indexing took 80hrs (~300 items/sec). Not a big deal since it's done now but worth mentioning.

I think this is related to the low capacity of your VM, because we have at the same time storing in LMBD (with memory page, so nice to have RAM) and indexing with ES, which is also very RAM hungry. Even if the RAM does not look used, it is in reality because it's memory paging (RAM is used as much as available by LMDB).

I have a very good server and got everything processed for CrossRef in 2h 43m :)

-- Counters --------------------------------------------------------------------
crossref_failed_indexed_records
             count = 0
crossref_indexed_records
             count = 148831541
crossref_storing_rejected_records
             count = 8123459

-- Meters ----------------------------------------------------------------------
crossref_storing
             count = 148840337
         mean rate = 15194.88 events/second
     1-minute rate = 16574.74 events/second
     5-minute rate = 16992.97 events/second
    15-minute rate = 17176.07 events/second

BUILD SUCCESSFUL in 2h 43m 28s
3 actionable tasks: 1 executed, 2 up-to-date

real    163m28.910s
user    0m6.562s
sys 0m3.821s

Server has 32 CPU and 128GB RAM :) but I stored also the abstracts (everything running on the same server).

We can see also that we have 8796 CrossRef records stored, but not indexed (148840337-148831541), this is something I will investigate.

lfoppiano commented 1 month ago

Server has 32 CPU and 128GB RAM :) but I stored also the abstracts (everything running on the same server).

Did you have to change any special parameters? I increased the memory of elastic search to 64G and to the crossref task to 64G but I'm way lower than your values:

-- Counters --------------------------------------------------------------------
crossref_failed_indexed_records
             count = 0
crossref_indexed_records
             count = 1014324
crossref_storing_rejected_records
             count = 65676

-- Meters ----------------------------------------------------------------------
crossref_storing
             count = 1019324
         mean rate = 5671.59 events/second
     1-minute rate = 5549.65 events/second
     5-minute rate = 4115.43 events/second
    15-minute rate = 3361.42 events/second

I'm using the SSD (on AWS) and I've set up the fastest throughput for it

I'm not sure what may I do to increase the throughput 🤔

kermitt2 commented 1 month ago

I used a recent workstation, all SSD, and unchanged parameters. It's likely that AWS SSD and cpu are not comparable with bare metal.

For memory paging performance, half of the memory used for elasticsearch should be left to the system. So if 32GB memory for elasticsearch JVM, 32GB must be left to the system. I think it's the same for the glutton loading task and LMDB. Personally I used the default setting also for the memory and I think at least half of the memory was always available for the OS.

lfoppiano commented 1 month ago

Thanks @kermitt2 ! I have also some limitations on the number of CPUs, but it looks like even they have SSD, the performance is not satisfying. The SSD's performances also are decreasing with time. I did allocate 64Gb for elastic and 64Gb for the java job, but I will try to allocate 32Gb instead

karatekaneen commented 1 month ago

@lfoppiano What you're describing sounds like what I encountered. I had similar performance in the beginning with it dropping over time until I hit 300/s on average. I also used a VM with SSD but on GCP so maybe this only affects cloud instances and not bare metal for some reason?

lfoppiano commented 1 month ago

I don't remember exactly, but GCP seemed to me that was working faster, using the SSD. I forgot. Since I was using the free credits I had limitation to 250Gb maximum so, at the end I did not manage to load the full database 😢

I will try again, and if I find the solution I will post here.

lfoppiano commented 1 month ago

I'm testing a new instance with nitro hypervisor and I think I increased the throughput of the disks...

it's better but quite far from your performance, @kermitt2 @karatekaneen @kermitt2 can you make the same test on your machine? I'd be interested to compare

read test:
```
sudo hdparm -Ttv  /dev/nvme0n1
```

/dev/nvme0n1: readonly = 0 (off) readahead = 256 (on) geometry = 1024000/64/32, sectors = 2097152000, start = 0 Timing cached reads: 41542 MB in 1.99 seconds = 20912.98 MB/sec Timing buffered disk reads: 1058 MB in 3.00 seconds = 352.56 MB/sec


- write test:

dd if=/dev/zero of=/tmp/mnt/temp oflag=direct bs=128k count=16k dd: failed to open '/tmp/mnt/temp': No such file or directory ubuntu@ip-172-31-33-189:~$ dd if=/dev/zero of=/tmp/temp oflag=direct bs=128k count=16k 16384+0 records in 16384+0 records out 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 22.6979 s, 94.6 MB/s


Update: it took 13h to load the full index, with performance dropping from 9k/second to 3k/second. quite a lot

8/21/24, 8:33:21 AM ============================================================

-- Counters -------------------------------------------------------------------- crossref_failed_indexed_records count = 0 crossref_indexed_records count = 149803402 crossref_storing_rejected_records count = 8185933

-- Meters ---------------------------------------------------------------------- crossref_storing count = 149805141 mean rate = 3033.63 events/second 1-minute rate = 1564.45 events/second 5-minute rate = 1564.89 events/second 15-minute rate = 1565.73 events/second

kermitt2 commented 1 month ago

read

lopez@trainer:~$ sudo hdparm -Ttv /dev/nvme0n1p2

/dev/nvme0n1p2:
 readonly      =  0 (off)
 readahead     = 256 (on)
 geometry      = 3814934/64/32, sectors = 7812984832, start = 1050624
 Timing cached reads:   49550 MB in  2.00 seconds = 24818.52 MB/sec
 Timing buffered disk reads: 5198 MB in  3.00 seconds = 1732.26 MB/sec

write

lopez@trainer:~$ dd if=/dev/zero of=/tmp/temp oflag=direct bs=128k count=16k
16384+0 records in
16384+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.921125 s, 2.3 GB/s

(PCIe 5)

kermitt2 / biblio-glutton

Combine synchronized storing and indexing, re-organize components, HAL support (no more DOI centric approach), dependency update, import format update #92