lbechberger / ML4NLP

Material for the Practical Seminar "Machine Learning for Natural Language Processing"
MIT License
2 stars 4 forks source link

Feedback for Group Delta #7

Open lbechberger opened 5 years ago

lbechberger commented 5 years ago

This is the thread where all the other groups leave their feedback for the documentation of group Delta.

ljscfo commented 5 years ago

First of all, great and especially detailed text. I like it that you look at the task from different perspectives and state a conclusive line of arguments to justify the choices you made. But nevertheless, my task is to hint towards things that you could improve (which is hard enough):

You seem to have the problem that what (question answering), how (machine learning and classificator) and based on what (news articles which have already been processed into tuples) don't really fit together. While the documentation explains greatly that you want to use questions that can be answered by the subject-predicate-object-triples, you get a bit lost when defining the actual problem you want to solve. Is it the extraction of tuples that aren't in the KnowledgeStore database? (The two paragraphs under the question types suggest something like that.) Then the question answering would be a some kind of separate task based only on tuples. Or is it question answering using mainly raw articles? The paragraphs "Selecting the correct article.." and "Answering the question.." point towards that approach. Though I must admit I get a bit lost in the latter paragraphs over trying to figure out what you mean with an incomplete tuple.

Two other minor things: In the first paragraph "Overall task", you talk about training data. Sure thing that we guys know what you're talking about but a third-party person could only guess from the seminar title that you're talking about how to train a classificator in the context of machine learning. In "How it's broken down.." you say something about 12pps. No idea what pps means.

cstenkamp commented 5 years ago

<Idk If we're supposed to answer these>

you get a bit lost when defining the actual problem you want to solve. Is it the extraction of tuples that aren't in the KnowledgeStore database? (The two paragraphs under the question types suggest something like that.) Then the question answering would be a some kind of separate task based only on tuples.

With the two-step-architecture (ignoring parsing & language generation), we want to be able to extract knowledge from new articles. The tuples which are already extracted are only a step in between, from which we can bootstrap our dataset. We don't want to generate these tuples for new texts, but skip this step.

Though I must admit I get a bit lost in the latter paragraphs over trying to figure out what you mean with an incomplete tuple.

A triple/complete tuple is a set of subject-predicate-object. An incomplete tuple is one were either of those are missing - that's what's supposed to be generated from the question (Who shot Lincoln is a tuple of [<?>, , ]

In the first paragraph "Overall task", you talk about training data. Sure thing that we guys know what you're talking about but a third-party person could only guess from the seminar title that you're talking about how to train a classificator in the context of machine learning.

Well, I assumed the audience knows what that is :D

In "How it's broken down.." you say something about 12pps. No idea what pps means.

It means percentage points. Like absolute percent. Wasn't sure if that was a thing, maybe it's not :D

pphilihpp commented 5 years ago

Hi Group Delta,

this is our feedback for your documentation of the last week. Overall you did a great job and explained very well, which dataset you will use for your classification task. But let us go a bit more into detail.

The content of your documentation is very good. It seems to be technically accurate, although concepts like a B-LSTM, GloVe or GoogleNews Word2vec are hard to verify within two days. Moreover your documentation is consistent, because you stick to same terms like information-triples and the shortcut of Name Entity Recognition (NER) in your whole text. At the first glance there are no key points missing in your documentation and therefore we consider it as complete. A point for an improvement would be that you explain your Design Decisions a little bit more in detail. For example it is a little bit unclear, why you prefer the Name Entity Recognition (NER) database more than other apporaches.

In the following we give some comments on your style. Most of your text is readable and therefore easily accessible to the reader. However, there are some grammatically and spelling mistakes in your text, that definetly should be corrected in the future (e. g. a missing comma after "information-triples" in the first part, "step-by-step" begins with a small letter whereas "Training" begins with a big letter in the "Limitation"-Part and a couple of time you have written "sentense" instead of "sentence"). Your text is well structured, due to the division into different sections with own topics. Examples & Visualization are a little bit missing in your documentation, but due to the mostly clear explanations that is not such a big issue.

So finally, you provided a good documentation this week. Next time you should pay a little more attention to your style (especially the grammatically and spelling mistakes) and then you are good to go!

bmajumderuos commented 5 years ago

Review Week 4:

Overall your documentation is detailed, consistent and complete. Following are the comments for the individual areas:

Regarding actual approach/design decisions: The approach and design decisions have been clearly detailed. One question that remains is: out of the four entities, how do you select two entities in sentences where there are more than two entities present?

Regarding completeness of the actual documentation: The documentation seems complete with all the relevant details and examples being covered.

Regarding Style & Readability: Overall the classification of topics and readability is great. One way to improve it even more would be to separate the last two paragraphs into suitable sub-headings.

pphilihpp commented 5 years ago

Hello Group Delta,

here is the review for your documentation of week 4. You described very well, what you did this week at what your current problems are. Overall you did a good job, but we identified some weaknesses.

Regarding the content: Your documentation is technically accurate. NLTK seems to have some problems in extracting the desired information triples. It is understandable that you are looking for other possibilities to extract these triples. Moreover, your documentation is consistent, because you often stick to the same words and do not shuffle around different terms for the same content. Well done! Unfortunately, we would not say, that your documentation is complete because some important key points for you Design Decisions are missing. For example, you said that you "run some components of the NLTK pipeline" over the articles. Which components do you exactly used? What components are in the NLTK pipeline? Where is the code for this? This are all questions that the reader would ask immediately. Moreover, a visualization and a reference to the source code would be really helpful in order to understand what you did. Another example is "using a custom gazetteer". Most of the readers will not know what this is or how you used it, so a deeper explanation would be really helpful.

Regarding the code: We really like your modular codebase and how you divided your overall program into multiple small functions. Moreover, the naming of your variables helps the understanding of your source code. Overall your code quality is really good, but the next time you could maybe link some of your documentation to your codebase.

Regarding the style: Most of your documentation is easily accessible and grammatically accurate. The mention of the terms above ("NLTK Pipeline" or "gazetteer") without an explanation makes it more difficult for the reader to understand your text, but despite these "heavy" terms, everything else is readable. Moreover, your text is well structured and in the example was really good and improved the understanding of your documentation very much!

Finally, you did a good job of documenting the progress of your project this week. In the future, you could maybe provide some references of your documentation to your codebase, include some visualizations and explain terms that are not so easy to understand for possible readers a little bit better.

Good luck for implementing the next steps!

JuAutz commented 5 years ago

As usual, the structure of your documentation was clear and well done. No backtracking to previous weeks was needed to understand it, and you wrote a very concise overview of your current state. The diagram is a nice touch to visualize your process. However, something that I felt the documentation lacked was a discussion what the numbers mean for you. I.e. you have 139,343 triples, but do you think that is enough, to little or maybe even too much to work with? Are you okay with roughly 8 hours of processing time? I.e. for us is seems good, but that because our approach takes more than two days but never needs to be repeated. However, if this has to be done more than once, 8 hours might be bad. Also, some background information on the IKW grid, even just a link to a website, would be useful follow what you are doing. If we would never have heard of it before, I wouldn't understand why jobs are killed, or why the memory might be an issue. We also wanted to have a look at your code, it would have been nice if you mentioned which files are relevant for your tasks, as you have quite a number of them, and some, i.e. "text.py", "project.py", "notes.py" or your sge files, we are not sure if they are part of your current process, of just old leftovers. Also, as you are printing large amounts of data in i.e. get_triples, you might consider making use of a logger, so that you can set information that to a low priority logging level and not spam your console when you don't need it. Small nitpicking, you misspelled database "databse" once.

bmajumderuos commented 5 years ago

Hi Group Delta,

Feedback Week 7: Overall your documentation is detailed, consistent and complete. Following are the comments for the individual areas:

Regarding actual approach/design decisions: The approach and design decisions have been clearly explained and detailed.

Regarding completeness of the actual documentation: The documentation seems complete with the relevant details covered. A bit more detail on Matthews correlation coefficient would have been better since that is something that wasn’t part of the discussions. The codes well commented(you seem to have a comment in german in the entities_meta_infomation.py , would be great to have everything in english) and easy to understand.

Regarding Style & Readability: Overall the classification of topics and readability is good. You have split the content under suitable sub-headings and it’s easy to read and understand.

Good luck with your dataset!