hipster-philology / pyrrha

A language-independent post-correction app for POS-tagging and lemmatization
https://pyrrha.huma-num.fr
MIT License
27 stars 16 forks source link

Check how the application works with postgresql #267

Open PonteIneptique opened 2 years ago

PonteIneptique commented 2 years ago

Is your feature request related to a problem? Please describe.

The current app shows limits with SQLite: it's slow, and does not benefit from a fast SQL engine when having different users doing multiple writes.

Describe the solution you'd like

FrFerry commented 2 years ago

Here is what to do to load a sqlite dump in a postgresql database:

DELETE FROM word_token WHERE corpus IN (SELECT DISTINCT wt.corpus FROM word_token as wt LEFT OUTER JOIN corpus as c ON c.id = wt.corpus WHERE c.id is NULL);

DELETE FROM token_history WHERE word_token_id IN (SELECT DISTINCT th.word_token_id FROM token_history as th LEFT OUTER JOIN word_token as wt ON wt.id = th.corpus WHERE wt.id is NULL);

DELETE FROM token_history WHERE corpus IN (SELECT DISTINCT th.corpus FROM token_history as th LEFT OUTER JOIN corpus as c ON c.id = th.corpus WHERE c.id is NULL);

DELETE FROM bookmark WHERE corpus_id IN (SELECT DISTINCT b.corpus_id FROM bookmark as b LEFT OUTER JOIN corpus as c ON c.id = b.corpus_id WHERE c.id is NULL);

DELETE FROM bookmark WHERE token_id IN (SELECT DISTINCT b.token_id FROM bookmark as b LEFT OUTER JOIN word_token as wt ON wt.id = b.token_id WHERE wt.id is NULL);

DELETE FROM favorite WHERE corpus_id IN (SELECT DISTINCT f.corpus_id FROM favorite as f LEFT OUTER JOIN corpus as c ON c.id = f.corpus_id WHERE c.id is NULL);

DELETE FROM corpus_custom_dictionary WHERE corpus IN (SELECT DISTINCT ccd.corpus FROM corpus_custom_dictionary as ccd LEFT OUTER JOIN corpus as c ON c.id = ccd.corpus WHERE c.id is NULL);

(I added .txt in file name only because github asked it, it's only a file with a command in it)

MrGecko commented 2 years ago

@FrFerry I have only one error during the pgloader moment:

2022-03-28T13:09:13.673334+02:00 ERROR Database error 22001: valeur trop longue pour le type character varying(64)
CONTEXT: COPY corpus, ligne 401, colonne name : « Extrait des Lunettes de Princes (ca 1461-1465) de Jean Meschinot, d’après le manuscript BnF, fr. ... »

Result: the corpus table is empty.

Yet I can create fresh new corpora and manage them as will, so nothing apparently broke atm.

FrFerry commented 2 years ago

@FrFerry I have only one error during the pgloader moment:

2022-03-28T13:09:13.673334+02:00 ERROR Database error 22001: valeur trop longue pour le type character varying(64)
CONTEXT: COPY corpus, ligne 401, colonne name : « Extrait des Lunettes de Princes (ca 1461-1465) de Jean Meschinot, d’après le manuscript BnF, fr. ... »

Result: the corpus table is empty.

Yet I can create fresh new corpora and manage them as will, so nothing apparently broke atm.

Well ... indeed we have more than 64 characters. These kind of restriction wasn't check by sqlite ?

In my opinion we have 2 solutions here:

What do you think ?

PonteIneptique commented 2 years ago

We should remove the limit, or set it to a higher count, such as 254

FrFerry commented 2 years ago

We should remove the limit, or set it to a higher count, such as 254

I pushed the modification on the pull request.

MrGecko commented 2 years ago

Even though it now seems to be working, I would not merge the PR before the tests are fixed (too many failures related to sql constraints) or investigated. Might be discussed/worked in an other issue though.

carinedengler commented 2 years ago

Hi, I'm taking over for @FrFerry for this ticket.

@FrFerry told me that the tests are mainly broken because of constraints that SQLite did not check but PostgreSQL does, but that these are not related to the behavior of the application. He proposed to not migrate the tests to PostgreSQL and instead to continue to use SQLite for the tests. What is your position on this?

carinedengler commented 2 years ago

I also opened a new PR with @FrFerry 's commits and my own changes to continue working on this ticket, could you close https://github.com/hipster-philology/pyrrha/pull/274 to avoid confusion?

carinedengler commented 2 years ago

I am currently trying to install the application using PostgreSQL - however, I cannot find the information on which version we have agreed on (if any) ? I'd go for PostgreSQL 14 unless you have some restrictions on your side ?

PonteIneptique commented 2 years ago

Please go for it. AFAIK, we should have no issues with PSQL 14 and I am not sure that there would be huge breaking change with 13 :)

MrGecko commented 2 years ago

I've done my tests with the v13.6 so far and its ok but please, proceed with v14 :)

carinedengler commented 2 years ago

OK, thanks for your reply!

carinedengler commented 2 years ago

Just to clarify this point, when I run the tests on https://github.com/hipster-philology/pyrrha/pull/280 they fail

when you say the tests need to be fixed, do you mean the tests running using SQLite? Do you want us to port the tests as well to PostgreSQL?

PonteIneptique commented 2 years ago

Hi @carinedengler, you should really check the current version of the dev branch, as it includes fixes for the current sqlite-only version of the application. Some tests were known for having a chance of failing (80% of the time), this seems to have fixed it. This also raises requirements version for security reasons.

carinedengler commented 2 years ago

I checked the dev branch out this morning and rebased the postgresql branch on it, the 3 failing tests (using SQLite) I have are related to the changes made for the PostgreSQL port. The question is therefore whether you would want me to fix the tests so they can be run on SQLite after the port to PostgreSQL or whether you would want me to port the SQLite tests to PostgreSQL ?

PonteIneptique commented 2 years ago

The change I am talking about seems to have been merged after your last commits ( https://github.com/hipster-philology/pyrrha/pull/281 ), are you sure you are up to date ? I pinged you on the PR with the current PR fixing those tests.

carinedengler commented 2 years ago

I must have fetched/rebased before you committed your latest changes, I should be up to date now

should I correct the tests for SQLite or for PostgreSQL ? or both ?

PonteIneptique commented 2 years ago

should I correct the tests for SQLite or for PostgreSQL ? or both ?

The application should run on both SQLite and PostgreSQL, as it is still used locally by some researchers.

CI Tests should also actually have a version for SQLite and one for PostgreSQL

carinedengler commented 2 years ago

OK, thanks for the clarification

carinedengler commented 2 years ago

when creating a new development instance with python manage.py run what should be the default DBMS used ?

PonteIneptique commented 2 years ago

I think the default in dev mode should be SQLite, but we could have a new dev settings as Dev-Postgresql to facilitate work with this one

carinedengler commented 2 years ago

I think the default in dev mode should be SQLite, but we could have a new dev settings as Dev-Postgresql to facilitate work with this one

Could you give an example how you would use that? Since db.engine.dialect.name can be used to retrieve the DBMS currently in use, and DEV_DATABASE_URL allows for passing both SQLite and PostgreSQL accesses, it seems to me that introducing another setting is not necessary.

PonteIneptique commented 2 years ago

I think I would simply have a new config object in the config.py file that could be used from with the manage CLI

PonteIneptique commented 2 years ago

So I'd go with the information retrieval you talk about and the DATABASE URI parameter, yes :)

carinedengler commented 2 years ago

OK, I get your idea now :+1:

carinedengler commented 2 years ago

the tests should now be fixed for SQLite - I won't have time before the end of the week (maybe even start next week) to have a look at the PostgreSQL tests, but introducing a switch for the tests for SQLite or PostgreSQL and fixing the PostgreSQL tests will be the next items on the list