henryliangt / usyd

0 stars 0 forks source link

5310 assignment 2B #25

Open henryliangt opened 2 years ago

henryliangt commented 2 years ago

THE PROJECT WORK FOR THIS STAGE: SUMMARY • [Done together] Identify an attribute that you will all make predictions about, and find a dataset that contains this attribute. The attribute you are predicting may be quantitative or nominal. The dataset may be one from the previous stages of this project. • [Done together] Decide on the measure of success for the predictive models you will be producing. You will need to justify your choice of measure and describe its strengths and limitations. • [Done together] Divide the dataset into a training set and a test set. We suggest having at least one-tenth of the original dataset in the test dataset. • [Done together] Coordinate in choosing the methods you will use, to each produce a predictive model for this attribute, using the training dataset (the coordination is needed to avoid duplication between members, and to enable a good conclusion for your report). • [Done separately by each member] Use Python (for example, the scikit-learn library) to produce a predictive model for the chosen attribute, from the training dataset, using the kind of model and the training method, which was allocated to you by the group. If your method for training has hyper-parameters, you should adjust them as well as possible, but only using parts of the training dataset in doing so [You must not use any of the test dataset for this.] • [Done separately by each member] Evaluate the quality of the predictive model you produced, in terms of the measure of success that the group chose. • [Done separately by each member] Write your section in Part A of the report, in which you present the work you have done individually. • [Done together] Write Part B of the report, that discusses the different models and their strengths and weaknesses. This should be written for a reader who is interested in your research or business question • [Done together] Produce a PDF of the whole report, with all individual sections and the jointly-written Part B, and produce the compressed folder with all the data and code from each member. Submit it all. IDENTIFY DATASET, ATTRIBUTE TO PREDICT, MEASURE OF PREDICTIVE SUCCESS: The models created in this Stage must all be predicting (in different ways) one common attribute in the one common dataset. You are allowed to use a dataset you already have from Stage 1 or 2A, but you are equally free to change dataset and even domain. There are no requirements for particular origin or volume in the dataset for this Stage, but note that many machine learning techniques do not work well unless the dataset is quite clean. We recommend that you do some preliminary data analysis to convince yourself that there is some relationship between the other attributes and the one you are going to predict (otherwise predictions will not be very effective). You also need to choose how you will measure the effectiveness of predicting; we recommend that you use one of the measures that is built-in for scikit-learn to calculate, given the test data and the predictions made for those items. For higher levels than pass, you need more than one measure that you will calculate on each model. CHOOSE THE TYPE(S) OF PREDICTIVE MODEL AND THE TRAINING METHODS : Each member needs to produce one predictive model, that will predict the chosen attribute from the values of some or all of the other attributes. Details are in the marking scheme below. It is required that all the members have different ways to produce their predictive model. So you need to coordinate among the members, in case two members

want to do the same approach, one at least will need to change (a bit – maybe you can each use the same general training technique, but scale the data attributes differently, or use a different subset of the input attributes, etc). INDIVIDUAL WORK: Each member then needs to work with the training set and the test set, to produce the material for their section in Part A of the report. This will involve writing Python code (we recommend using scikit-learn) to produce a predictive model based on the training set, and then running the model on the test set and calculating the agreed metric for how good are the predictions. Part A needs to include the code you each write; higher levels of mark require additional discussion and explanation (as indicated in the markings scheme) WRITE A REPORT: Working together as a group, you need to produce a report. The structure of the report is described below in detail, as the report is the main basis for grading in this project. The report has sections for each member’s separate work, as well as a brief combined introduction that explains the topic or issue, and a combined presentation of conclusions. PRODUCE PDF AND ZIPPED FOLDER, AND SUBMIT: From the combined document, you need to produce a PDF. As well, there needs to be a file which compresses a folder, within which are subfolders for each member, the subfolders contain the dataset the member worked with, and the code or spreadsheet for producing their analysis (both summaries and charts). One person submits both PDF and zipped folder, to the submission links on Canvas, on behalf of the whole group. Every member of the group will get the marks earned by the combined submission.

GROUP PROCESS During the project, you need to manage the work among the group members. We insist that every person do each activity, and describe what they did and found in the appropriate section of the report and in the appropriate subfolder of the compressed folder that gets submitted. We intend for the members to compare regularly and learn from one another (as well as from tutor feedback during lab sessions). Because any member’s poor work will reduce everyone’s score for the group component of the marking (unless it is properly documented in a “Note to marker”), make sure to quickly report any difficulty in working together to the unit coordinator as described above, or (if too late for that) use the “Note to marker” to explain what has happened.

WHAT TO SUBMIT, AND HOW: There are two deliverables in this Stage of the Project. Both should be submitted by one person, on behalf of the whole group. The marks from this stage will appear in canvas Page 4 of 7 gradebook as being associated with the report submission; the other submission has no marks appearing for it in Canvas, but it can be used as evidence in determining the mark for the stage.

SUBMIT A STAGE 2B WRITTEN REPORT ON YOUR WORK, AS A PDF. This should be submitted via the link in the Canvas site. The report should have two Parts. Part A should be targeted at a tutor or lecturer whose goal is to see what you achieved, so they can allocate a mark. Part B is targeted at someone who is interested in your research or business question, and so wants to understand how well various machine learning approaches work for producing predictive models in the context of your research or business question. The report should have a front page, that gives the group name, and lists the members involved (giving their SID and unikey, not their name), and then the body of the report has structure as follows (this corresponds to the marking scheme):

  1. In Part A, there is an initial section. This section is not marked as such, it is just so the marker can understand the setting for the rest of the report. In this section you must:
  2. State your research or business question.
  3. State the domain and the dataset you are using.
  4. Indicate how you split this into training and test data.
  5. Next in Part A, there should be one section for each member (the section should state the SID/unikey of the group member who did the work reported in this section). In this section, there should be some subsections
  6. A description of the way you produced the predictive model, including the Python code you wrote that produces the model, and any pre-processing eg rescaling some attributes. If possible, you should also give the predictive model itself (eg for a linear regression, you would report what coefficients each attribute has in the model; for a decision tree you would state the different decision points)
  7. The evaluation of how well your predictive model does in predicting; this must include the code you wrote that calculates some measure of effectiveness (on the test data), as well as stating the actual value of this measure for your predictive model. For higher marks, textual discussion is also needed - see the mark scheme below). For example, you may consider using significance testing, confidence intervals, regression r-square, clustering V-measure, classification f1-score.
  8. There is a single Part B, jointly written by the group. It is written for readers who are interested in your business question. In it, you describe the different ways the members produced predictive members, and comment on the evaluations, to draw conclusions about the strengths and limitations of the different approaches, tying this back to your business question (see the marking scheme for more guidance on what is expected here). There is no required minimum or maximum length for the report; write whatever is needed to show the reader that you have earned the marks, and don’t say more than that! Pass level performance should be feasible in less than one page per member, plus a conclusion that is less than a page. However, reports that are unnecessarily longwinded will be penalized (see the penalties section below).

SUBMIT A COPY OF THE STAGE 3 DATA AND CODE. This should be submitted through the Canvas system, as a single zip or tar.gz file. So you Page 5 of 7 should put have a single folder, with subfolders for each member. The subfolder for a member should contain the Python code to calculate a predictive model and calculate some measure of effectiveness of the model (as well, if you have done any further transforms on attributes before training/testing, the code for these should also be part of what is in your folder). You then compress the top folder (with all these subfolders and their contents), then submit the single compressed file.

MARKING Here is the mark scheme for this assignment. The score (out of five) is the sum of separate scores for each of three components. Note that there is an individual and a group component to each member’s mark. PRODUCING PREDICTIVE MODELS [2 POINTS] [INDIVIDUAL MARK] This component is assessed based on the corresponding subsection of the separate member sections in Part A of the report; the uploaded data and code may be checked by the marker as supporting evidence for claims made in the report. Full marks: the Distinction criteria hold, and also there is a clear explanation of any method that is not presented in the tutorials, and an argument for why this is a reasonable approach to consider for the task (this discussion should go well beyond simply reporting that the model predicts well, to argue that one could reasonably hope that it might be good, in several ways) Distinction: the Pass criteria hold, and also at least one of the methods used must go beyond what is covered in the tutorials. Pass: the group member (except when the situation is reasonably explained in a “Note to Marker”) uses Python and the agreed training dataset, and with these correctly produces a predictive model for the agreed attribute; The code that each member wrote to produce their model (including doing any preliminary attribute transformations) must be explicitly shown in the report. The ways in which the various members’ models are produced should all be different from one another (this could be different algorithmic training techniques, different choice of hyper-parameters, different scaling or choice of input attributes, etc). Flawed: Some predictive model is produced using Python. EVALUATIONS OF PREDICTIVE MODELS [2 POINTS] [INDIVIDUAL MARK] This component is assessed based on the corresponding subsection of the separate member sections in Part A of the report; the uploaded data and code may be checked by the marker as supporting evidence for claims made in the report. Full marks: the Distinction criteria hold, and also, for each approach, there is a reasonable discussion relating the outcome of the measurements to the nature of the training approach, characteristics of the dataset and any transformations done. Distinction: the group member (except when the situation is reasonably explained in a “Note to Marker”) has correctly reported on more than one measure of performance of the model on the test dataset; the code that does this measurement must be explicitly shown in the report. Also for each approach there is a sensible discussion of the interpretation of the measurements (for example, whether it is indicating overfitting or underfitting, whether the accuracy/precision/recall/F1 score differs between different Page 6 of 7 classes in your data). Pass: the group member has correctly reported on some measure of performance of the model on the test dataset; the code that does this measurement must be explicitly shown in the report. The ways in which the various members’ models are produced should all be different from one another (this could be different algorithmic training techniques, different choice of hyper-parameters, different scaling or choice of input attributes, etc). Flawed: Some reasonable attempts to evaluate the effectiveness of some of the predictive models. DISCUSSION [5 POINTS] [GROUP MARK] This component is assessed based on Part B (group component) of the report. Material in Part A, or the submitted data and code, may be checked by the marker as supporting evidence for claims made in the report. Full marks: the Discussion section has all the Distinction criteria, and it suggests at least one reasonable improvement that can be made to each member’s predictive model. The structure needs to be logical and well-organised. Distinction: the Discussion section provides some accurate and clear information about the different machine learning methods that were used for this task, and provides useful insight into strengths and weaknesses of the different machine learning methods for answering the business or research question. It also indicates features of the dataset that impact on the outcomes. It also discusses honestly and with insight, the strengths, limitations and uncertainties about the comparisons made between different machine learning techniques (for example, what are strengths and limitations of the measurements which were used). Pass: the Discussion section provides some accurate and clear information about the machine learning techniques that were used for this task, and how the resulting predictive models performed. Flawed: the Discussion section describes the machine learning techniques that were used. CONCLUSION [1 POINT] [GROUP MARK] This component is assessed based on Part B (group component) of the report. Material in Part A, or the submitted data and code, may be checked by the marker as supporting evidence for claims made in the report. Full marks: the Conclusion section has all the Distinction criteria, and also makes reasonable suggestions for future work on your analysis and predictive models that can help derisk the recommended course of action. Distinction: in addition to the requirements for a Pass, the Conclusion describes the extent of support for this course of action, based on the information in the Discussion section, identifying what risks, limitations and caveats apply. Pass: the Conclusion section describes a recommended course of action in relation to your research or business question, that is supported by the information in the Discussion section. Flawed: the Conclusion section describes a recommended course of action in relation to your research or business question. Page 7 of 7 Penalties 10% of the overall marks available will be deducted if your report is unnecessarily longwinded and does not concisely address the marking criteria. In making this judgement, we will consider how well you respect your collaborators and stakeholders time, whether the length is justified by the quality of the information in the report, and to what extent it contains only the information needed to address the marking criteria

henryliangt commented 2 years ago

08_clustering_and_dimensionality_reduction.pdf