Closed kermitt2 closed 5 months ago
Would love to help to get this completed. What's left to do before this can be merged?
Hi @karatekaneen ! I hope you're doing well. It's actually complete - except what is in the todo list, but that would be too much for this PR. I was waiting for some feedback from a user, but it's fully functional according to my tests. I will try to merge it next week-end after quickly reviewing the documentation.
Lovely! I'll take it for a test spin as soon as I get the chance. Haven't worked with Solr before so might be a bit tricky to set up for me
The tests seems to be broken which makes the gradlew clean build
command fail.
When running gradlew clean jar
instead to skip the tests it seems to work and the server starts up. Waiting for the data to download and then I'll try to get everything up and running.
Here's some of the output from the failed step:
> Task :compileTestJava FAILED
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/UnpaidWallReaderTest.java:3: error: package com.scienceminer.lookup.data does not exist
import com.scienceminer.lookup.data.UnpayWallMetadata;
^
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/UnpaidWallReaderTest.java:14: error: cannot find symbol
UnpayWallReader target;
^
symbol: class UnpayWallReader
location: class UnpaidWallReaderTest
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/PmidReaderTest.java:13: error: cannot find symbol
PmidReader target;
^
symbol: class PmidReader
location: class PmidReaderTest
/srv/glutton/src/test/java/com/scienceminer/lookup/reader/IstexIdsReaderTest.java:3: error: package com.scienceminer.lookup.data does not exist
import com.scienceminer.lookup.data.IstexData;
Sorry I forgot working on the tests! They have been updated.
@kermitt2 Tried it out and it works perfectly. Haven't tried the HAL stuff though since it's of no interest for us. Awesome work!
The only thing I've noticed is that the indexing was quite slow. I used a VM with 2cpu + 6gb RAM and something similar for the Elasticsearch instance. None of them ran at max capacity on cpu or memory but the indexing took 80hrs (~300 items/sec). Not a big deal since it's done now but worth mentioning.
The only thing I've noticed is that the indexing was quite slow. I used a VM with 2cpu + 6gb RAM and something similar for the Elasticsearch instance. None of them ran at max capacity on cpu or memory but the indexing took 80hrs (~300 items/sec). Not a big deal since it's done now but worth mentioning.
I think this is related to the low capacity of your VM, because we have at the same time storing in LMBD (with memory page, so nice to have RAM) and indexing with ES, which is also very RAM hungry. Even if the RAM does not look used, it is in reality because it's memory paging (RAM is used as much as available by LMDB).
I have a very good server and got everything processed for CrossRef in 2h 43m :)
-- Counters --------------------------------------------------------------------
crossref_failed_indexed_records
count = 0
crossref_indexed_records
count = 148831541
crossref_storing_rejected_records
count = 8123459
-- Meters ----------------------------------------------------------------------
crossref_storing
count = 148840337
mean rate = 15194.88 events/second
1-minute rate = 16574.74 events/second
5-minute rate = 16992.97 events/second
15-minute rate = 17176.07 events/second
BUILD SUCCESSFUL in 2h 43m 28s
3 actionable tasks: 1 executed, 2 up-to-date
real 163m28.910s
user 0m6.562s
sys 0m3.821s
Server has 32 CPU and 128GB RAM :) but I stored also the abstracts (everything running on the same server).
We can see also that we have 8796 CrossRef records stored, but not indexed (148840337-148831541), this is something I will investigate.
Server has 32 CPU and 128GB RAM :) but I stored also the abstracts (everything running on the same server).
Did you have to change any special parameters? I increased the memory of elastic search to 64G and to the crossref task to 64G but I'm way lower than your values:
-- Counters --------------------------------------------------------------------
crossref_failed_indexed_records
count = 0
crossref_indexed_records
count = 1014324
crossref_storing_rejected_records
count = 65676
-- Meters ----------------------------------------------------------------------
crossref_storing
count = 1019324
mean rate = 5671.59 events/second
1-minute rate = 5549.65 events/second
5-minute rate = 4115.43 events/second
15-minute rate = 3361.42 events/second
I'm using the SSD (on AWS) and I've set up the fastest throughput for it
I'm not sure what may I do to increase the throughput 🤔
I used a recent workstation, all SSD, and unchanged parameters. It's likely that AWS SSD and cpu are not comparable with bare metal.
For memory paging performance, half of the memory used for elasticsearch should be left to the system. So if 32GB memory for elasticsearch JVM, 32GB must be left to the system. I think it's the same for the glutton loading task and LMDB. Personally I used the default setting also for the memory and I think at least half of the memory was always available for the OS.
Thanks @kermitt2 ! I have also some limitations on the number of CPUs, but it looks like even they have SSD, the performance is not satisfying. The SSD's performances also are decreasing with time. I did allocate 64Gb for elastic and 64Gb for the java job, but I will try to allocate 32Gb instead
@lfoppiano What you're describing sounds like what I encountered. I had similar performance in the beginning with it dropping over time until I hit 300/s on average. I also used a VM with SSD but on GCP so maybe this only affects cloud instances and not bare metal for some reason?
I don't remember exactly, but GCP seemed to me that was working faster, using the SSD. I forgot. Since I was using the free credits I had limitation to 250Gb maximum so, at the end I did not manage to load the full database 😢
I will try again, and if I find the solution I will post here.
I'm testing a new instance with nitro hypervisor and I think I increased the throughput of the disks...
it's better but quite far from your performance, @kermitt2 @karatekaneen @kermitt2 can you make the same test on your machine? I'd be interested to compare
sudo hdparm -Ttv /dev/nvme0n1
/dev/nvme0n1: readonly = 0 (off) readahead = 256 (on) geometry = 1024000/64/32, sectors = 2097152000, start = 0 Timing cached reads: 41542 MB in 1.99 seconds = 20912.98 MB/sec Timing buffered disk reads: 1058 MB in 3.00 seconds = 352.56 MB/sec
- write test:
dd if=/dev/zero of=/tmp/mnt/temp oflag=direct bs=128k count=16k dd: failed to open '/tmp/mnt/temp': No such file or directory ubuntu@ip-172-31-33-189:~$ dd if=/dev/zero of=/tmp/temp oflag=direct bs=128k count=16k 16384+0 records in 16384+0 records out 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 22.6979 s, 94.6 MB/s
Update: it took 13h to load the full index, with performance dropping from 9k/second to 3k/second. quite a lot
8/21/24, 8:33:21 AM ============================================================
-- Counters -------------------------------------------------------------------- crossref_failed_indexed_records count = 0 crossref_indexed_records count = 149803402 crossref_storing_rejected_records count = 8185933
-- Meters ---------------------------------------------------------------------- crossref_storing count = 149805141 mean rate = 3033.63 events/second 1-minute rate = 1564.45 events/second 5-minute rate = 1564.89 events/second 15-minute rate = 1565.73 events/second
lopez@trainer:~$ sudo hdparm -Ttv /dev/nvme0n1p2
/dev/nvme0n1p2:
readonly = 0 (off)
readahead = 256 (on)
geometry = 3814934/64/32, sectors = 7812984832, start = 1050624
Timing cached reads: 49550 MB in 2.00 seconds = 24818.52 MB/sec
Timing buffered disk reads: 5198 MB in 3.00 seconds = 1732.26 MB/sec
lopez@trainer:~$ dd if=/dev/zero of=/tmp/temp oflag=direct bs=128k count=16k
16384+0 records in
16384+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 0.921125 s, 2.3 GB/s
(PCIe 5)
This is a working PR for version 0.3, which introduces many changes:
Todo: