Fixed ID Updated node - Githubissues

kermitt2 / biblio-glutton

A high performance bibliographic information service: https://biblio-glutton.readthedocs.io

117 stars 15 forks source link

Fixed ID Updated node #58

Closed lfoppiano closed 2 years ago

lfoppiano commented 3 years ago

.. continuation of #50 cc @karatekaneen

Additional:

support for both grenerlab single file (.xz), JSON file or crossref dump (multiple gz files)
update dependencies and compilation output directory

Commands:

Run indexing process tsc & node dist/main index -dump /Volumes/SEAGATE1TB/scienceminer/crossref_public_data_file_2021_01
Check the communication with the index tsc & node dist/main health

NOTE: I tried to preserve the various tabulations, but failed miserably. Please make sure to ignore the spaces when comparing

lfoppiano commented 3 years ago

OK. It works properly with the grenerlab file compressed or not, however, the PR is not yet ready to be submitted.

When I specify the crossref dump it crashes with

Uncaught Error: EMFILE: too many open files, open '/Volumes/SEAGATE1TB/scienceminer/crossref_public_data_file_2021_01/32085.json.gz'

I think there is something wrong in the way I handle the files and the script is iterating over them while it should process them either in serial or not open too many files in parallel...

Any clue?

kermitt2 commented 2 years ago

I think there is something wrong in the way I handle the files and the script is iterating over them while it should process them either in serial or not open too many files in parallel...

Yes as it is, it's not sequential, all the 40230 files are opened at the same time. index() can to return a Promise and the call in the loop can use await so that the files are processed sequentially. In PR #61 I did something like this:


async function index(options) {
    if (options.dumpType === 'directory') {
        const files = fs.readdirSync(options.dump);
        for (const file of files) {
            ...
            const file_path = path.join(options.dump, file);
            await indexFile(options, file_path);
        }
    } else {
        await indexFile(options, options.dump);
    }
}

function indexFile(options, dumpFile) {
    return new Promise((resolve, reject) => {
           ....

           readStream.on("end", function () {
                 ...
                 return resolve();
           }

           readStream.on("error", function(err) {
                 ...
                 return reject();
        });
    });
}

lfoppiano commented 2 years ago

I suppose this is already implemented in #66? If so I would close both this PR and #58

kermitt2 commented 2 years ago

The support for indexing the various dumps has been implemented in the new version, but not the changes of #50. So we could close this PR, but the other one #50 still stands on the road :)

karatekaneen commented 2 years ago

@lfoppiano @kermitt2 I'm so sorry, I didn't notice that I stopped receiving notifications from github so haven't seen any of this.

Anyways, I'm not sure that I understand what needs to be done in #50 to be able to resolve the issues that you are experiencing?

kermitt2 commented 2 years ago

Hi @karatekaneen !

I worked on a new version of biblio-glutton on branch https://github.com/kermitt2/biblio-glutton/pull/66 which should fix all the import issues with latest crossref, add support for more formats and incremental updates, and many more... I have also updated the js indexing for all the new formats, migrated to newer ElasticSearch, ... So all these features should be now well covered and tested, so there is no issue still remaining I think with the new branch.

However, regarding #50 I didn't move to typescript yet, this is the only part of the PR #50 not integrated for the moment.

karatekaneen commented 2 years ago

Oh, that's nice! Sounds like a lot of features that I'd love to try out.

Regarding typescript i think it's only a "nice to have" since it only helps by creating errors at write-time and is a good crutch while writing. But in this relatively small script it can also be a bit redundant

lfoppiano commented 2 years ago

I close this as it's old