YAML - pros and cons - Githubissues

sebastianruder / NLP-progress

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

https://nlpprogress.com/

MIT License

22.71k stars 3.62k forks source link

YAML - pros and cons #117

Closed sebastianruder closed 6 years ago

sebastianruder commented 6 years ago

I'd like to discuss here the pros and cons of using YAML going forward or whether we should stick with Markdown tables. Here are some pros and cons, mainly from @NirantK (in https://github.com/sebastianruder/NLP-progress/pull/116), @stared (in https://github.com/sebastianruder/NLP-progress/issues/43, https://github.com/sebastianruder/NLP-progress/pull/64) and myself.

Pros:

Easier trend spotting in performance improvements
Easy to create plots and visualizations going forward
Data is separated from presentation

Cons:

Hard for contributors, e.g. HTML omissions can't be spotted without setting up Jekyll locally
Github Repo becomes useless for readers, relying exclusively on nlpprogress.com
Many visualizations (e.g. bar charts) based on performance numbers are not more useful than the raw tables

Other opinions are welcome.

NirantK commented 6 years ago

Thinking out loud:

Assuming that markdown tables can be parsed with something like fsm, we can probably use markdown tables + git logs for plotting and trend spotting.

We could also automate a bot which periodically, say, every 2 weeks - dumps markdown data into more machine readable _data folder for such usage.

stared commented 6 years ago

@NirantK It is nowhere near that simple. Turning Markdown tables to YAMLs required a lot of my manual labour (even with some automatization) - various formats, some formatting mistakes, etc.

Also, for converting tables to YAML I wrote this script: https://gist.github.com/stared/ec29b1e8d3c99a6288dcc20d77affc93

It requires some manual inspection, as:

there is some inconsistency with table formats
there is some misformatting (e.g. no closing |)
I manually check if to use &Author2018 and <<: *Author2018 mappings

NirantK commented 6 years ago

Thanks for sharing that script @stared ! Some neat hacks there.

I am hoping that if we enforced a markdown table linter of some sort, this would be slightly less tedious to do. I definitely don't claim that it is simple.

To focus on the issue at hand, I am simply asking if the loss in reader (and contributor) ease of access is worth the gain from visualizations?

sebastianruder commented 6 years ago

Yep, a table linter or better enforcement of style guidelines is something we'd definitely want to do.

So far, I haven't really seen any visualizations that added much value beyond what the tables provide. The progress visualizations at AI metrics are nice, but I don't think they're that helpful if a task doesn't have a clear metric of human performance. @stared, do you have any thoughts regarding a "killer visualization" that would clearly warrant using YAML files?

NirantK commented 6 years ago

Hey @stared - just following up :)

stared commented 6 years ago

OK, I know it is a matter of taste. Personally for me YAML files are easier to edit than Markdown tables, and are less error-prone (end certainly simpler than Markdown table + enforcing linter). I admit that for others can have different opinions, depending on the background.

With killer features:

visualization (all markdown scraping will be clunky)
possibility to add OTHER data (e.g. comments, other fields when they become necessary)
possibility of copying entries (before there was redundancy and there were errors)

For contributions, I think that the tricky part is to inform where is the (can be done easily, by adding an automatic link [edit entry in filename]).

For viewing changes - by pushing to one's own repos, one can see it online.

When it comes to visualizations - true, that for many area (especially if there are only 4 entries or so) it does not provide that much additional information.

sebastianruder commented 6 years ago

While I really like the idea of separating the presentation from the data and storing the data in a dedicated format, the benefits at this point to me seem to be overshadowed by the additional burden placed on the contributor (who might not have used YAML before) and on the reader (who won't be able to view the tables on GitHub).

As at this point the objective should be to get more data (for more tasks and languages) in this repo, these two disadvantages to me outweigh the potential upsides of using YAML.

NirantK commented 6 years ago

@sebastianruder should I go ahead and refactor the Hindi and Korean pages to use Markdown?

sebastianruder commented 6 years ago

Yes, let's do that. Thanks!

sebastianruder commented 6 years ago

Ok. So as things stand now, I think it'll be more beneficial to the community to have things in the more readable Markdown format to facilitate reading and contributing. We can think again about converting to YAML if there's a more immediate need in the future.