Closed lfoppiano closed 2 years ago
OK. It works properly with the grenerlab file compressed or not, however, the PR is not yet ready to be submitted.
When I specify the crossref dump it crashes with
Uncaught Error: EMFILE: too many open files, open '/Volumes/SEAGATE1TB/scienceminer/crossref_public_data_file_2021_01/32085.json.gz'
I think there is something wrong in the way I handle the files and the script is iterating over them while it should process them either in serial or not open too many files in parallel...
Any clue?
I think there is something wrong in the way I handle the files and the script is iterating over them while it should process them either in serial or not open too many files in parallel...
Yes as it is, it's not sequential, all the 40230 files are opened at the same time. index()
can to return a Promise and the call in the loop can use await so that the files are processed sequentially. In PR #61 I did something like this:
async function index(options) {
if (options.dumpType === 'directory') {
const files = fs.readdirSync(options.dump);
for (const file of files) {
...
const file_path = path.join(options.dump, file);
await indexFile(options, file_path);
}
} else {
await indexFile(options, options.dump);
}
}
function indexFile(options, dumpFile) {
return new Promise((resolve, reject) => {
....
readStream.on("end", function () {
...
return resolve();
}
readStream.on("error", function(err) {
...
return reject();
});
});
}
I suppose this is already implemented in #66? If so I would close both this PR and #58
The support for indexing the various dumps has been implemented in the new version, but not the changes of #50. So we could close this PR, but the other one #50 still stands on the road :)
@lfoppiano @kermitt2 I'm so sorry, I didn't notice that I stopped receiving notifications from github so haven't seen any of this.
Anyways, I'm not sure that I understand what needs to be done in #50 to be able to resolve the issues that you are experiencing?
Hi @karatekaneen !
I worked on a new version of biblio-glutton on branch https://github.com/kermitt2/biblio-glutton/pull/66 which should fix all the import issues with latest crossref, add support for more formats and incremental updates, and many more... I have also updated the js indexing for all the new formats, migrated to newer ElasticSearch, ... So all these features should be now well covered and tested, so there is no issue still remaining I think with the new branch.
However, regarding #50 I didn't move to typescript yet, this is the only part of the PR #50 not integrated for the moment.
Oh, that's nice! Sounds like a lot of features that I'd love to try out.
Regarding typescript i think it's only a "nice to have" since it only helps by creating errors at write-time and is a good crutch while writing. But in this relatively small script it can also be a bit redundant
I close this as it's old
.. continuation of #50 cc @karatekaneen
Additional:
Commands:
Run indexing process
tsc & node dist/main index -dump /Volumes/SEAGATE1TB/scienceminer/crossref_public_data_file_2021_01
Check the communication with the index
tsc & node dist/main health
NOTE: I tried to preserve the various tabulations, but failed miserably. Please make sure to ignore the spaces when comparing