nytud / emtsv

e-magyar text processing system -- inter-module communication via tsv + REST API
GNU Lesser General Public License v3.0
27 stars 11 forks source link

Better documentation of module relations, data requirements, output formats #19

Open beepsoft opened 3 years ago

beepsoft commented 3 years ago

emtsv is a really great tool, thanks for your work!

I'm all new to NLP so maybe that's the reason for all my problems, but only reading the documentation it is rather difficult to work effectively with emtsv

One main thing I miss from the documentation is what each module's input and output is:

https://github.com/dlt-rilmta/emtsv#modules

For example, if I want to use the chunk module I don't know what data it needs so that it can run.

Starting naively like this:

echo "Ez jó lenne, ha működne!" | python3 main.py chunk

... I get this error:

xtsv.pipeline.ModuleError: ERROR: 'Tagger' module requires {'form', 'xpostag'} fields but the previous module 'Input Text' has only {'Ez jó lenne, ha működne.'} fields!

That's fine, but which module will generate 'form', 'xpostag'? After some trial and errors I could figure out that I need tok,morph,pos,chunk, but this is a tedious way to find it out.

The topology description is somewhat helpful (https://github.com/dlt-rilmta/emtsv/blob/master/docs/emtsv_modules.pdf) but it uses the "package names" instead of the module names expected by emtsv. Eg. it contains emToken while in emtsv it needs to be referenced as tok.

It would also be great to know what each column in the result actually means and how these columns should be interpreted. This is also something really difficult to find out even after reading a lot of publication related to emtsv and e-magyar.

So, a nice documentation structure for someone just getting started with emtsv would be something like this:

  1. Description of packages (emToken, emMorph, etc)
  2. What modules each package provides for emtsv (tok, morph, etc)
  3. Required input and output of each module + modules providing those inputs
  4. Description of the output formats (form, anas, xpostag, etc.)

1-2. is already available, 3. and 4. is what I am missing.

sassbalint commented 3 years ago

@dlazesz Balázs, I guess .fig format may be thrown out as only few :) people is eager to work with it.\ Have you any idea about a convenient format which is suitable for shared work?

beepsoft commented 3 years ago

@sassbalint thanks for picking up this issue!

@dlazesz Balázs, I guess .fig format may be thrown out as only few :) people is eager to work with it. Have you any idea about a convenient format which is suitable for shared work?

You mean for replacing emtsv_modules.pdf or what would this .fig would be used for? Unfortunately I have no idea about this.

dlazesz commented 3 years ago

For the record. The FIG is meant to be edited and then converted to the PDF. Bálint (@sassbalint) used to maintain the FIG.

As both Bálint and I have been left the project. I proposed that Noémi (@vadno) could do a one-time rewrite in Tikz to enable it for others to edit it more conveniently in the future as new modules emerge. I do not want to speak on her behalf.

I have no other ideas how it would be easier for everybody to maintain the figure or who would actually do it in the first place. All ideas, suggestions and applications for maintaining are welcome!

@beepsoft You could send PRs on the documentation (or any part of the project) if you have any ideas how to improve it.

sassbalint commented 3 years ago

@dlazesz Balázs, could you draw (by hand!) a figure on the current state of the system?\ If yes, we could talk about it on zoom and then I will create a new version (in .fig...).

@beepsoft .fig is to be edited by xfig which is an old but very good quality piece of software, I think.

vadno commented 3 years ago

As @dlazesz mentioned, I'll draw a tikz version of the figure. @sassbalint, xfig is great, but for me tikzpicture is a bit easier to use. I try to do it asap... OK?

sassbalint commented 3 years ago

As @dlazesz mentioned, I'll draw a tikz version of the figure. xfig is great, but for me tikzpicture is a bit easier to use. I try to do it asap... OK?

Thank you, @vadno Noémi. :)

While, as Balázs put it, "one-time rewrite in Tikz to enable it for others to edit it more conveniently in the future" sounds good, I guess that there is a chance that by creating the Tikz version you just take over this task for a long time, in practice. Are you OK with this? :)

vadno commented 3 years ago

@sassbalint No, I'm not OK with this :) I try to write it as clear as possible, hoping that later others can extend it without my help. But of course I help if needed ;)

dlazesz commented 3 years ago

UPDATE: Thanks to @vadno the new module figure design has been commited: https://github.com/nytud/emtsv/blob/master/docs/emtsv_modules.pdf

Hope it can handle better the growing number of modules. We plan to restructure and maybe split the figure as more input-output modules are planned in the near future.

I keep this issue open as the current update does not solve the OP just tries to ease the situation. More documentation is on its way.