paracrawl / corset

Corset is a web-based data selection portal that helps you getting relevant data from massive amounts of parallel data.
https://corset.paracrawl.eu
GNU General Public License v3.0
17 stars 3 forks source link

Cannot write to /var/solr as 8983:8983 - dp-solr log #13

Open jonwolds opened 1 year ago

jonwolds commented 1 year ago

Hi,

I created the docker containers using docker-compose.yaml and got many of the same issues as stated in the now closed issue. There are errors in the dp-front and dp-back logs as well (gunicorn-error.log isn't writable, redis.log -> can't open the log file: No such file or directory). The dp-solr container doesn't actually seem to start up though.

Any help much appreciated.

Jon

mbanon commented 1 year ago

Hi @jonwolds , the redis issue usually happens when the path where the redis config file is created does not exist inside the docker container. It's fixed by creating this path by hand.

jonwolds commented 1 year ago

Many thanks for that. I'm still getting errors in dp-front and dp-back "gunicorn.errors.HaltServer: <HaltServer 'Worker failed to book.' 3>,

Also, the Cannot write to /var/solr as 8983:8983 persists. I could edit the permissions of /var/solr by exec-ing in, but the container stops as soon as it starts, so I'd have to create a new container, I suppose, but then I'm not sure about how to link that back up with docker-compose. Any ideas?

jonwolds commented 1 year ago

Answering my own question, this worked: (sudo) chown 8939:8938 /mnt/solr-data/solr

I hadn’t realised what the dp-solr container was when I first asked the question.

mbanon commented 1 year ago

Awesome! Is already everything working for you?

jonwolds commented 1 year ago

localhost:5000 page fires up fine, but I guess I need to do a lot of configuration work.

Basically, I just want to load up a couple of the manufactured corpora from paracrawl.eu for my own personal use, so I’ve no need for the authentication system, and I don’t think I’m able to install the Google app because I don’t have access to Google Workspace.

Any tips on the best order to do things in would be very welcome.

Many thanks in advance!

jonwolds commented 1 year ago

I'm still struggling to get the set up working. The login process appears to work fine, but I then get sent to localhost:5000/search where I get a 500 - Internal Server Error. The log produced (from dp-front) is below

2023-01-26 19:25:53 +0000] [15] [INFO] Booting worker with pid: 15 [2023-01-26 19:26:11,728] ERROR in app: Exception on /search/ [GET] Traceback (most recent call last): File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 2447, in wsgi_app response = self.full_dispatch_request() File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1952, in full_dispatch_request rv = self.handle_user_exception(e) File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1821, in handle_user_exception reraise(exc_type, exc_value, tb) File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/_compat.py", line 39, in reraise raise value File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1950, in full_dispatch_request rv = self.dispatch_request() File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1936, in dispatch_request return self.view_functionsrule.endpoint File "/opt/dp/front/venv/lib/python3.9/site-packages/flask_login/utils.py", line 272, in decorated_view return func(*args, **kwargs) File "/opt/dp/front/app/blueprints/search/views.py", line 54, in search_view corpus_collection = base_corpus.solr_collection AttributeError: 'NoneType' object has no attribute 'solr_collection'

I guess that's related to the set-up of the solr collection, which is where I'm struggling to follow the deployment instructions. I copied the solr.xml to the directory referenced by docker-composer.yaml and changed the permissions, but the dp-solr log still says:

2023-01-26 19:25:56.480 INFO (main) [] o.a.s.s.CoreContainerProvider Solr Home: /var/solr/data (source: system property: solr.solr.home) 2023-01-26 19:25:56.483 INFO (main) [] o.a.s.c.SolrXmlConfig solr.xml not found in SOLR_HOME, using built-in default

I put a core.properties file there, too, but it references a schema.xml and a solrconfig.xml, which I do not know how to set up (no instructions in the deployment guide).

Also, the deployment section (I may be jumping the gun here) says "Go to the web interface of your Solr instance", but localhost:5000 is the only port open, so I don't really understand what this means.

Any ideas? Many thanks in advance,

Jon

mbanon commented 1 year ago

Hi again Jon! I've been taking a look into the Dockerfiles and, according to https://github.com/paracrawl/corset/blob/master/docker-compose.yaml#L53 , I think the Solr web interface should be reachable at localhost:8090 (or maybe localhost:8090/solr).

Regarding the missing schema.xml it's in the root folder of the repository (and also here. As for the solrconfig.xml I am not 100% confident, but I think it's self-generated by solr.

jonwolds commented 1 year ago

Thanks for following up, Marta. It's much appreciated!

Here's my progress so far (I won't be working on this for the next week).

I got into the solr web interface by adding
ports:

In the end, I created the new core using: ./solr create -c name-of-your-new-core

I had to exec in to the dp-solr container to do this as my efforts via the web interface were not successful in creating a solrconfig.xml file. This new core then shows up in the solr web interface correctly. It probably needs adjusting using the schema.xml file from the corset directory, too. Changing the permissions (chown 8983:8983) is always necessary, too

I'm still getting an internal server error at localhost:5000/search, but hopefully once I load some data into the core I've created things might improve.

jonwolds commented 1 year ago

This is the error message I'm getting from dp-front

2023-02-05 11:40:42,481] ERROR in app: Exception on /search/ [GET]
Traceback (most recent call last):
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask_login/utils.py", line 272, in decorated_view
    return func(*args, **kwargs)
  File "/opt/dp/front/app/blueprints/search/views.py", line 54, in search_view
    corpus_collection = base_corpus.solr_collection
AttributeError: 'NoneType' object has no attribute 'solr_collection'

I've tried to upload some data using tmxutils, but that hasn't been successful yet

mbanon commented 1 year ago

Hi again! The error suggests that no corpus are registered in the DB (which makes sense because you are having trouble with that :)) What error are you getting when trying to upload data?

jonwolds commented 1 year ago

Hi again, I've been having a look at this again, and I've now managed to upload data into solr, but I still can't manage to sort out the link between solr and dp-front.

My configuration is most likely wrong, but the information provided is not quite enough to get it working.

The specific error I'm getting in the gunicorn-error.log is:

[2023-03-09 18:37:23 +0000] [13] [INFO] Booting worker with pid: 13
[2023-03-09 18:37:37,216] ERROR in app: Exception on /search/ [GET]
Traceback (most recent call last):
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask_login/utils.py", line 272, in decorated_view
    return func(*args, **kwargs)
  File "/opt/dp/front/app/blueprints/search/views.py", line 54, in search_view
    corpus_collection = base_corpus.solr_collection
AttributeError: 'NoneType' object has no attribute 'solr_collection'

http://localhost:5000/search/ produces a 500 Internal Server Error.

Any ideas?

Cheers, Jon

jonwolds commented 1 year ago

I got to the next stage and finally managed to get the /search/ page to appear properly by using an INSERT SQL command tailored to the solr collection I had created based on the model in the greyed-out part of the dpdb_initdb.sql file.

Unfortunately, the search function still doesn't find anything. I'm guessing there's more configuration to do with the dpdb tables in postgres.

jonwolds commented 1 year ago

This is the error message in the gunicorn-error.log (dp-front)

[2023-03-12 17:34:53,188] ERROR in app: Exception on /query/ [GET]
Traceback (most recent call last):
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/dp/front/venv/lib/python3.9/site-packages/flask_login/utils.py", line 272, in decorated_view
    return func(*args, **kwargs)
  File "/opt/dp/front/app/blueprints/query/views.py", line 32, in query_view
    base_corpus = base_corpus_bo.get_base_corpora_by_pair(source_lang.code, target_langs[0].code)[0]
IndexError: list index out of range
mbanon commented 1 year ago

Hi Jon, all your errors seem related to the fact that "get_base_corpora_by_pair" is not returning anything. This is probably caused by the DB being empty (or not properly filled by the INSERT you made by hand), or the connection between the front, the back and the DB does not work.

Some hints:

jonwolds commented 1 year ago

Hi again Marta,

This is what I have in the basecorpora table:

"id"    "name"  "description"   "source_lang"   "target_lang"   "sentences" "size_mb"   "solr_collection"   "is_active" "is_highlight"
1   "TMXcore FR-EN" "French English tmx"    12  1   22093   20  "tmxcore"   true    true

Can you see anything obviously wrontg? tmxcore is the name of the solr core.

Thanks again for your help!

Jon

jonwolds commented 1 year ago

I can see that the search terms (e.g. charter here) are making it from dp-front to dp-solr, but no hits are displayed. This is the log from dp-solr:

2023-03-14 19:52:12.238 INFO  (qtp1622458036-22) [ x:tmxcore] o.a.s.c.S.Request webapp=/solr path=/select params={q=trg:"charter"&hl=true&start=0&hl.fragsize=0&hl.fl=trg&sort=custom_score+desc&rows=50&wt=json} hits=38 status=0 QTime=11
2023-03-14 19:52:38.274 INFO  (qtp1622458036-25) [ x:tmxcore] o.a.s.c.S.Request webapp=/solr path=/select params={q=src:"charter"&hl=true&start=0&hl.fragsize=0&hl.fl=src&sort=custom_score+desc&rows=50&wt=json} hits=1 status=0 QTime=1
jonwolds commented 1 year ago

OK, I think by reversing the order of the languages, so English is first (source) and French the second (target), that initial error is averted. However, the following error is now showing up in the gunicorn-api-error.log in dp-back:

[2023-03-15 21:00:55,678] ERROR in app: Exception on /search [GET]
Traceback (most recent call last):
  File "/opt/dp/back/venv/lib/python3.7/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/dp/back/venv/lib/python3.7/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/flask_restx/api.py", line 375, in wrapper
    resp = resource(*args, **kwargs)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/flask/views.py", line 89, in view
    return self.dispatch_request(*args, **kwargs)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/flask_restx/resource.py", line 44, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/flask_login/utils.py", line 272, in decorated_view
    return func(*args, **kwargs)
  File "/opt/dp/back/api/resources/search.py", line 55, in get
    return SearchResponse.schema().dump(search_response), 200
  File "/opt/dp/back/venv/lib/python3.7/site-packages/dataclasses_json/mm.py", line 343, in dump
    dumped = Schema.dump(self, obj, many=many)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/schema.py", line 558, in dump
    result = self._serialize(processed_obj, many=many)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/schema.py", line 523, in _serialize
    value = field_obj.serialize(attr_name, obj, accessor=self.get_attribute)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/fields.py", line 328, in serialize
    return self._serialize(value, attr, obj, **kwargs)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/fields.py", line 716, in _serialize
    return [self.inner._serialize(each, attr, obj, **kwargs) for each in value]
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/fields.py", line 716, in <listcomp>
    return [self.inner._serialize(each, attr, obj, **kwargs) for each in value]
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/fields.py", line 583, in _serialize
    return schema.dump(nested_obj, many=many)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/dataclasses_json/mm.py", line 343, in dump
    dumped = Schema.dump(self, obj, many=many)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/schema.py", line 558, in dump
    result = self._serialize(processed_obj, many=many)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/schema.py", line 523, in _serialize
    value = field_obj.serialize(attr_name, obj, accessor=self.get_attribute)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/fields.py", line 328, in serialize
    return self._serialize(value, attr, obj, **kwargs)
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/fields.py", line 916, in _serialize
    ret = self._format_num(value)  # type: _T
  File "/opt/dp/back/venv/lib/python3.7/site-packages/marshmallow/fields.py", line 891, in _format_num
    return self.num_type(value)
TypeError: float() argument must be a string or a number, not 'list'
mbanon commented 1 year ago

Hi! Yes, I think that having English first is mandatory.

As for the last error, I had never seen that. I see that, in the error, "flask_login" is mentioned. As mentioned above, you were not using google login. How are you managing authorization and users? Login is needed in search requests (https://github.com/paracrawl/corset/blob/master/back/api/resources/search.py#L18)

jonwolds commented 1 year ago

Yes, it's a weird error.

I don't think it's linked to login, because that is now working perfectly, My earlier comment was incorrect!