Open lbechberger opened 6 years ago
First of all, the amount of citations and references was amazingly high. Especially the introduction felt well researched, and very informative. The examples used were current, common and varied enough that the benefit of news recommendation was obvious. Your clear statement of goals is something to appreciate. However, the general approaches part marked a drop in quality. The first two sentences contained spelling mistakes ("sill", "generel") , occasionally the grammar seemed off ("The underlying assumption is that if a person A has the same interest as a person B, the person A is more likely to have person B's interests than that of a randomly chosen person.", " An advantage of this strategy is that these data do not have to be entered into the system or maintained"). Also, the introduction and the project goal parts were written in a notably more clear style. Grammar aside, sentences like "The underlying assumption is that if a person A has the same interest as a person B, the person A is more likely to have person B's interests than that of a randomly chosen person." or "Given the established rules of statictics and data science, a tremendous amount of fake data would be needed for finding such intrinsic relations " make the text more difficult to read. ". In order of implementing such a news recommendation system..." is the first occasion where the flaws in grammar actually impede the understanding of the text.
Generally speaking, the text is technically accurate, and very extensive. The structure is clear, and guides the reader nicely through the process. The citations and examples add value and demonstrate the efforts that were made. However, the latter parts seemed hastily written, and are, especially due to iffy grammar, hard to read/understand. A second person to proofread the text would improve on these points.
Review Week3:
Overall your documentation is detailed, consistent and complete. Following are the comments for the individual areas:
Regarding actual approach/design decisions: Different approaches and design decisions have been clearly mentioned and their reasoning provided.
Regarding completeness of the actual documentation: The documentation seems complete with all the relevant details being covered. It also includes examples and citations for the sources referenced.
Regarding Style & Readability - It would be a good idea to separate the document into more sub-topics like Approaches, Choices, Summary and so on. It helps to know which part is referring to which topic/question.
Keep up the good work!
Review Week 4:
Your documentation for this week is technically accurate and consistent. It is clear from start to end that you spent this week evaluating two appraoches how to extract additional data, and your structure makes that visually appealing. Next to the fact that it's readable and grammatically accurate I especially like the visualization you used.
What I am however missing in your documentation are actual links to your source code as well as actual examples. For example, you say you are "utilizing more than 30 different Regular Expressions". Without looking at your actual sourcecode, it is impossible to say what these are doing exactly and why you had to implement these. Further I am wondering what the size of this subset is, and if it can for example be increased significantly by adding another 30 regexes. Also for your claim of the pollution and the drop in quality I would have liked more information and examples how this drop looks like in praxis. Furthermore, I am wondering similar things about your second approach - examples of this set of top-level categories would be nice, as well as information on how to create the structure recursively. Also for the last abstract, examples underlining your claim that almost 8000 new samples are not useful if they provide wrong categories would be useful.
In short, while I really like the general structure and style of the text, comply with your design decisions and think your documentation is perfectly readable and easily accessible, I would like more examples for your claims, as well as links to the source code - another thing that would be nice to know is for example what the precise results of your first experiments where, and why that rendered this approach not viable.
Dear Group Gamma,
Overall, you used many examples and nice visualizations that helped to understand your approach. Also, you used all terms consistently throughout your documentation. Still, a few grammatical mistakes and typos impeded the understandability slightly. The way you explained your strategy to exclude small categories could be explained in a bit more detail or at least in a clearer way, we think. Especially the sentence-part: "Given the sufficient amount of categories with a rather high amount of categories“ confused us, but we suppose that you meant to say: "Given the sufficient amount of categories with a rather high amount of articles“.
In the second part, we were unsure whether you plan to generate negative examples of articles in addition to the positive ones you have mentioned or if you only want to use positive examples, as you indicate. In your visualization it looks like you draw positive and negative examples from a set of merely positive examples of the internal user representation. A more detailed explanation on that part would help.
Concerning your last paragraph, we are not sure if we understood your approach correctly: You are using a classifier to generate a training set for each of your users and also train the same classifier on that training set, right? Please elaborate a bit more on the details of your plan.
We did not test your code, but we believe you that it works. You have chosen meaningful names for the functions and variables you use, but still a few more comments would help to understand it even better. You did not include a „How to…“ yet, which would be nice.
First of all, it's nice that you worked with last week's recension and included the part about your setup. Also, the example dataset contributes to the understanding and readablity of the documentation part. We were a bit confused about how the user representation, especially how the parameter number of representations works. Does it have impact on parameters like number articles per category, as indicated in the class description, or does it stand for a completely different way to represent a user? Also, how does the subclass Parameter work? You could go as far as to create something like an UML-diagram to explain the structure of the classes. Nevertheless, it's really great that you use object-oriented programming to model the dataset. Readability: Consider putting some things like files or parameter names italic. For example in sentence "The Dataset class gets the parameters number of user and all the parameters of the User class.", it would be much easier to read if you wrote the parameters and class names italic. Two things about grammatical accuracy: Have a look at rules for commas in the English language. Commas aren't set to introduce dependent clauses (don't use the commas e.g. in "...the IPython Jupyter Notebook file, you want to open...", "... missing are functions, that completely...") Some typos: "What was still missing are functions.." -> "What was still missing were functions"; "per user presentation" -> "per user representation", "to feet the dataset into" -> "to feed the dataset into"
Dear Group Gamma,
This week you wrote about the metrics you want to use in your model evaluation and explained why you want to use Accuracy, False Alarm Rate and Precision. As previously mentioned, we like your way of structuring your text and visualising when necessary. You wrote in an easily accessible language, used the terms correctly and coherently, even though you could be more consistent in the spelling, e.g. decide whether to write compound nouns with a hyphen or not (you used both in the case of “hyperparameter”), or if you want to capitalise proper nouns (“accuracy, False Alarm Rate, precision”). When you explained your splitting ratio you mentioned a “80%-20%-20%” split, but since you’re talking about percentages, you probably meant a different ratio like 80-10-10 or 60-20-20? In your “Dataset split” section, you wrote about the usage of a “specialized validation set” to “[...] ensure that not even the choice of hyper-parameter might leak training information into the final evaluation”. It would improve the understanding of your sentence, if you would go into more detail and explain the meaning a bit further. Last but not least, you referred in the same section to an “alternative in related literature” that you did not further cite or mention.
All in all, we liked your documentation this week.
First of all, the two latest parts of your documentation are written in a way that makes them easy to read but don't lack any professionalism. Compliment on that. Also, it's nice that you introduce feature extraction in a general way. For example the part on the usage of neural networks for both feature extraction and prediction - that's nice, didn't know that. Nevertheless, here are some suggestions on how you could still improve:
Feedback for Week 13: You documentation is very well written and seems to be complete in describing the approach you are taking towards the feature and model selection. "leaf it out" in 2. is probably a typo, but apart from that I did not find any grammatical or orthographic mistakes. I like that you include citations at the end of your documentation, that really makes it look very professional. I had only a quick glance at your code, but it seems very consistent in the naming and formatting choices you make, which is one of the most importatant parts of writing clean code in my opinion. All in all a very impressive documentation so far!
This is the thread where all the other groups leave their feedback for the documentation of group Gamma.