nytud / emtsv

e-magyar text processing system -- inter-module communication via tsv + REST API
GNU Lesser General Public License v3.0
27 stars 11 forks source link

Custom pipeline for REST server #5

Closed DavidNemeskey closed 5 years ago

DavidNemeskey commented 5 years ago

emtsv.py offers two modus operandi:

  1. as a standalone script
  2. as a REST server

The mode is chosen according to whether any tasks are specified on the command line or not. If at least one is specified, the script runs in standalone mode; otherwise, the REST server is started.

However, even with the rest server, we might not want to execute ALL tasks. It would be nice if we could get away with not loading all the models into memory, especially on low-memory systems.

dlazesz commented 5 years ago

Thank you for your feedback! As I see there is two separate issues:

Am I right?

I think the second one is already done, but not implemented in emtsv.py. Have you tried the 'lazy loading' feature? Maybe it is what you looking for.

I recommend using the REST server emtsv.py for only development and debuging purposes.

You can find an example to start a production-ready REST server in the docker folder. However, it does not feature the lazy loading by default. Would you prefer the lazy loading to be turned on by default?

DavidNemeskey commented 5 years ago

BTW I opened a pull request for this, have you had a look?

dlazesz commented 5 years ago

Yes, but I would prefer keeping emtsv.py as simple as possible. Flask's builtin server should only be used during development and testing when editing config.py accordingly is perfectly feasible. Other usecases are strongly discuraged, but possible through the Python API. I think lazy loading solves the problem dynamically in a more general way for all REST server cases.

Have you tried the lazy loading API instead? It is really simple. See Hunspell REST API for example. It loads each module on its first usage, so unused modules will not be loaded, thus keeping memory usage low.

If it does not fit for you, you can create your own main.py with custom behaviour using the Python API.

DavidNemeskey commented 5 years ago

I will check out lazy loading. However, it would be nice if the documentation described what lazy loading does in greater detail.

OK, if you want to keep emtsv.py simple, what about stripping the REST functionality out into (say) an emtsv_rest.py? Then users would be able to choose the script they want, they would always know what to expect, and we could add useful parameters to the latter (e.g. --port) without cluttering anything.

I think it is better if the repo provides the tools for the basic use-cases, as opposed to asking each user to write a new script for themselves. If you agree, I could quickly cook up the new script.

dlazesz commented 5 years ago

I admit that the documentation is in bad shape and it is lagging behind the new features. Sorry about that, as this might be one root cause of your issue. We are working on the developing of the documentation to be on par with the code. (We accept help on this.)

About your issue: Could you describe the usecases you mentioned in details? Currently we can not empathise with your requests, which is most likely due to the lack of documentation on the desired concrete usage examples.

DavidNemeskey commented 5 years ago

Here's one argument for the port number: since the REST server doesn't work with parallel requests (judging from the error message, it might be a bug in PurePos, I don't know), in order to run tasks parallelly, we could start multiple REST servers.

dlazesz commented 5 years ago

since the REST server doesn't work with parallel requests (judging from the error message, it might be a bug in PurePos, I don't know),

This worth filling a separate issue with a MWE.

Or just search around, if it is really a bug in emtsv or not: https://stackoverflow.com/questions/10938360/how-many-concurrent-requests-does-a-single-flask-process-receive/13929101#13929101

From the link:

When running the development server you get running app.run(), you get a single synchronous process, which means at most 1 requests being processed at a time.

This is one main reason why we do not recommend using REST mode in emtsv.py for production.

Have you tried it with a proper server instead of emtsv.py? Like using uwsgi config can be found in the docker folder. You might want to alter the line related to the number of processes and of course set the desired modules in the config. If the issue still presist with this configuration. Please feel free to fill a bugreport with an MWE in a separate issue.

makrai commented 5 years ago

I don't know much about the technical details of emtsv, just a general comment. @DavidNemeskey , while we appreciate the transparency of github conversations, sometimes it may be more efficient to discuss issues by phone (in Hungarian). So feel free to call Balázs, me, or Bálint (in this order)!

dlazesz commented 5 years ago