twosigma / beakerx

Beaker Extensions for Jupyter Notebook
http://BeakerX.com
Apache License 2.0
2.78k stars 382 forks source link

Exception publishing a particular notebook to web #174

Closed jchendy closed 9 years ago

jchendy commented 10 years ago

Load this ipynb: https://raw.githubusercontent.com/gumption/Python_for_Data_Science/master/Python_for_Data_Science_all.ipynb

After it's loaded, click notebook -> publish to web. The following goes to the terminal.

javax.servlet.ServletException: org.apache.http.client.HttpResponseException: Bad Request at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:420) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:538) at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:716) at javax.servlet.http.HttpServlet.service(HttpServlet.java:668) at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1465) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:960) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1021) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:865) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:668) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.http.client.HttpResponseException: Bad Request at org.apache.http.client.fluent.ContentResponseHandler.handleResponse(ContentResponseHandler.java:47) at org.apache.http.client.fluent.ContentResponseHandler.handleResponse(ContentResponseHandler.java:40) at org.apache.http.client.fluent.Response.handleResponse(Response.java:85) at org.apache.http.client.fluent.Response.returnContent(Response.java:92) at com.twosigma.beaker.core.rest.PublishRest.notebook(PublishRest.java:55) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185) at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRu:84)ava at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1511) at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1442) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1391) at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1381) at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416) ... 28 more

anglee commented 9 years ago

The error was: {"message":"Problems parsing JSON","documentation_url":"https://developer.github.com/v3"}

anglee commented 9 years ago

The JSON we sent looks ok: {"description":"Beaker Share","public":true,"files":{"Beaker Share":{"content":"{\"beaker\":\"2\",\"evaluators\":[{\"name\":\"Html\",\"plugin\":\"Html\",\"shellID\":null},{\"name\":\"Latex\",\"plugin\":\"Latex\",\"shellID\":null},{\"name\":\"IPython\",\"plugin\":\"IPython\",\"imports\":\"\",\"supplementalClassPath\":\"\",\"shellID\":null}],\"cells\":[{\"id\":\"text7JL7d3\",\"type\":\"text\",\"body\":\"<h1>Python for Data Science<\/h1>\",\"evaluatorReader\":false},{\"id\":\"markdownnBlRin\",\"type\":\"markdown\",\"body\":\"[Joe McCarthy](http:\/\/interrelativity.com\/joe), \\n*Director, Analytics & Data Science*, [Atigeo, LLC](http:\/\/atigeo.com)\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codejlZBph\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"from IPython.display import display, Image, HTML\"},\"output\":{\"result\":\"\"},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"textiufX5f\",\"type\":\"text\",\"body\":\"<h2>1. Introduction<\/h2>\",\"evaluatorReader\":false},{\"id\":\"markdown02rB6m\",\"type\":\"markdown\",\"body\":\"<a href=\\\"http:\/\/www.python.org\/\\\"><img src=\\\"https:\/\/www.python.org\/static\/community_logos\/python-logo.png\\\" style=\\\"margin: 0px 0px 5px 20px; width: 125px; float: right;\\\" title=\\\"Python\\\" alt=\\\"python-logo-master-v3-TM.png\\\" \/><\/a>\\nThis short primer on [Python](http:\/\/www.python.org\/) is designed to provide a rapid \\\"on-ramp\\\" to enable computer programmers who are already familiar with concepts and constructs in other programming languages learn enough about Python to facilitate the effective use of open-source and proprietary Python-based machine learning and data science tools.\\n\\n<a href=\\\"http:\/\/www.nltk.org\/book\/\\\"><img src=\\\"http:\/\/covers.oreilly.com\/images\/9780596516499\/cat.gif\\\" style=\\\"margin: 0px 0px 5px 20px; width: 125px; float: right;\\\" title=\\\"The Natural Language Toolkit (NLTK) Book\\\" alt=\\\"nltk_book_cover.gif\\\" \/><\/a>\\nThe primer is motivated, in part, by the approach taken in the [Natural Language Toolkit (NLTK) book](http:\/\/www.nltk.org\/book\/), which provides a rapid on-ramp for using Python and the open-source [NLTK library](http:\/\/www.nltk.org\/) to develop programs using natural language processing techniques (many of which involve [machine learning](http:\/\/www.nltk.org\/book\/ch06.html)).\\n\\nThe [Python Tutorial](http:\/\/docs.python.org\/2\/tutorial\/) offers a more comprehensive primer, and opens with an excellent - if biased - overview of some of the general strengths of the Python programming language:\\n\\n> Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python\u2019s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.\\n\\n<a href=\\\"http:\/\/www.amazon.com\/Python-Scripting-Computational-Science-Engineering\/dp\/3642093159\\\"><img src=\\\"https:\/\/images.springer.com\/sgw\/books\/medium\/9783540739159.jpg\\\" style=\\\"margin: 0px 0px 5px 20px; width: 125px; float: right;\\\" title=\\\"Python Scripting for Computational Science, by Hans Petter Langtangen\\\" alt=\\\"Python Scripting for Computational Science cover\\\" \/><\/a>\\n[Hans Petter Langtangen](http:\/\/folk.uio.no\/hpl\/), author of [Python Scripting for Computational Science](http:\/\/www.amazon.com\/Python-Scripting-Computational-Science-Engineering\/dp\/3642093159), emphasizes the utility of Python for many of the common tasks in all areas of computational science:\\n\\n> Very often programming is about shuffling data in and out of different tools, converting one data format to another, extracting numerical data from a text, and administering numerical experiments involving a large number of data files and directories. Such tasks are much faster to accomplish in a language like Python than in Fortran, C, C++, C#, or Java\\n\\n[Foster Provost](http:\/\/people.stern.nyu.edu\/fprovost\/), co-author of [Data Science for Business](http:\/\/data-science-for-biz.com\/), describes why Python is such a useful programming language for practical data science in [Python: A Practical Tool for Data Science](https:\/\/docs.google.com\/document\/pub?id=1p6vowsEuiezLbWnFKgse70a8LxfsrRixqPF5nBg8F3A), :\\n\\n> The practice of data science involves many interrelated but different activities, including accessing data, manipulating data, computing statistics about data, plotting\/graphing\/visualizing data, building predictive and explanatory models from data, evaluating those models on yet more data, integrating models into production systems, etc. One option for the data scientist is to learn several different software packages that each specialize in one or two of these things, but don\u2019t do them all well, plus learn a programming language to tie them together. (Or do a lot of manual work.) \\n> \\n> An alternative is to use a general-purpose, high-level programming language that provides libraries to do all these things. Python is an excellent choice for this. It has a diverse range of open source libraries for just about everything the data scientist will do. It is available everywhere; high performance python interpreters exist for running your code on almost any operating system or architecture. Python and most of its libraries are both open source and free. Contrast this with common software packages that are available in a course via an academic license, yet are extremely expensive to license and use in industry.\\n\\n<a href=\\\"http:\/\/scikit-learn.org\/\\\"><img src=\\\"http:\/\/scikit-learn.org\/stable\/_static\/scikit-learn-logo-small.png\\\" style=\\\"margin: 0px 0px 5px 20px; width: 125px; float: right;\\\" title=\\\"scikit-learn: Machine Learning for Python\\\" alt=\\\"scikit-learn-logo-small.png\\\" \/><\/a>\\nThe goal of this primer is to provide efficient and sufficient scaffolding for software engineers with no prior knowledge of Python to be able to effectively use Python-based tools for data science research and development, such as the open-source library [scikit-learn](http:\/\/scikit-learn.org\/). There is another, more comprehensive tutorial for scikit-learn, [Python Scientific Lecture Notes](http:\/\/scipy-lectures.github.io\/index.html), that includes coverage of a number of other useful Python open-source libraries used by scikit-learn ([numpy](http:\/\/www.numpy.org\/), [scipy](http:\/\/www.scipy.org\/) and [matplotlib](http:\/\/matplotlib.org)) - all highly recommended ... and, to keep things simple, all beyond the scope of this primer.\\n\\n<a href=\\\"http:\/\/atigeo.com\/technology\/\\\"><img src=\\\"http:\/\/atigeo.com\/wp-content\/uploads\/2013\/05\/xp-logo-forslider.png\\\" style=\\\"margin: 0px 0px 5px 20px; width: 125px; float: right;\\\" title=\\\"xPatterns, by Atigeo\\\" alt=\\\"xp-logo-forslider.png\\\" \/><\/a>\\nThe initial motivation for this primer was a 2-hour training session for a group of experienced software engineers to learn enough Python to utilize the [Atigeo xPatterns analytics framework API](http:\/\/atigeo.com\/technology\/) in their software development work. I am grateful to the company for affording me the opportunity to develop this educational tool, and to make it freely available to others who might be looking for a fast on-ramp to Python for data science.\\n\\nUsing an IPython Notebook as a delivery vehicle for this primer was motivated by Brian Granger's inspiring tutorial, [The IPython Notebook: Get Close to Your Data with Python and JavaScript](http:\/\/strataconf.com\/strata2014\/public\/schedule\/detail\/32033), one of the [highlights from my Strata 2014 conference experience](http:\/\/gumption.typepad.com\/blog\/2014\/02\/ipython-deep-learning-doing-good-some-highlights-from-strata-2014.html).\\n\\nOne final note on external resources: the [Python Style Guide (PEP-0008)](http:\/\/legacy.python.org\/dev\/peps\/pep-0008\/) offers helpful tips on how best to format Python code. [Code like a Pythonista](http:\/\/python.net\/~goodger\/projects\/pycon\/2007\/idiomatic\/handout.html) offers a number of additional tips on Python programming style and philosophy, several of which are incorporated into this primer.\\n\\nWe will focus entirely on using Python within the interpreter environment (as supported within an IPython Notebook). Python scripts - files containing definitions of functions and variables, and typically including code invoking some of those functions - can also be run from a command line. Using Python scripts from the command line may be the subject of a future primer. \\n\\nTo help motivate the data science-oriented Python programming examples provided in this primer, we will start off with a brief overview of basic concepts and terminology in data science.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textRORbOr\",\"type\":\"text\",\"body\":\"<h2>2. Data Science: Basic Concepts<\/h2>\",\"evaluatorReader\":false},{\"id\":\"text9HipGz\",\"type\":\"text\",\"body\":\"<h3>Data Science and Data Mining<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownBxk1Rw\",\"type\":\"markdown\",\"body\":\"<a href=\\\"http:\/\/data-science-for-biz.com\/\\\"><img src=\\\"http:\/\/akamaicovers.oreilly.com\/images\/0636920028918\/cat.gif\\\" style=\\\"margin: 0px 0px 5px 20px; width: 125px; float: right;\\\" title=\\\"Data Science for Business, by Provost and Fawcett\\\" alt=\\\"DataScienceForBusiness_cover.jpg\\\" \/><\/a>\\nFoster Provost and [Tom Fawcett](http:\/\/home.comcast.net\/~tom.fawcett\/public_html\/index.html) offer succinct descriptions of data science and data mining in [Data Science for Business](http:\/\/data-science-for-biz.com\/):\\n\\n> **Data science** involves principles, processes and techniques for understanding phenomena via the (automated) analysis of data.\\n> \\n> **Data mining** is the extraction of knowledge from data, via technologies that incorporate these principles.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textKRZvod\",\"type\":\"text\",\"body\":\"<h3>Knowledge Discovery, Data Mining and Machine Learning<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdowntm30Zs\",\"type\":\"markdown\",\"body\":\"Provost & Fawcett also offer some history and insights into the relationship between *data mining* and *machine learning*, terms which are often used somewhat interchangeably:\\n\\n> The field of Data Mining (or KDD: Knowledge Discovery and Data Mining) started as an offshoot of Machine Learning, and they remain closely linked. Both fields are concerned with the analysis of data to find useful or informative patterns. Techniques and algorithms are shared between the two; indeed, the areas are so closely related that researchers commonly participate in both communities and transition between them seamlessly. Nevertheless, it is worth pointing out some of the differences to give perspective.\\n> \\n>Speaking generally, because Machine Learning is concerned with many types of performance improvement, it includes subfields such as robotics and computer vision that are not part of KDD. It also is concerned with issues of agency and cognition \u2014 how will an intelligent agent use learned knowledge to reason and act in its environment \u2014 which are not concerns of Data Mining.\\n> \\n>Historically, KDD spun off from Machine Learning as a research field focused on concerns raised by examining real-world applications, and a decade and a half later the KDD community remains more concerned with applications than Machine Learning is. As such, research focused on commercial applications and business issues of data analysis tends to gravitate toward the KDD community rather than to Machine Learning. KDD also tends to be more concerned with the entire process of data analytics: data preparation, model learning, evaluation, and so on.\\n\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textjUkJ3i\",\"type\":\"text\",\"body\":\"<h3>Cross Industry Standard Process for Data Mining (CRISP-DM)<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownbPesjX\",\"type\":\"markdown\",\"body\":\"The [Cross Industry Standard Process for Data Mining](https:\/\/en.wikipedia.org\/wiki\/Cross_Industry_Standard_Process_for_Data_Mining) introduced a process model for data mining in 2000 that has become widely adopted.\\n\\n<a href=\\\"https:\/\/en.wikipedia.org\/wiki\/Cross_Industry_Standard_Process_for_Data_Mining\\\"><img src=\\\"https:\/\/upload.wikimedia.org\/wikipedia\/commons\/thumb\/b\/b9\/CRISP-DM_Process_Diagram.png\/479px-CRISP-DM_Process_Diagram.png\\\" title=\\\"Cross Industry Standard Process for Data Mining\\\" alt=\\\"CRISP-DM_Process_Diagram\\\" \/><\/a>\\n\\nThe model emphasizes the ***iterative*** nature of the data mining process, distinguishing several different stages that are regularly revisited in the course of developing and deploying data-driven solutions to business problems:\\n\\n* Business understanding\\n* Data understanding\\n* Data preparation\\n* Modeling \\n* Deployment\\n\\nWe will be focusing primarily on using Python for **data preparation** and **modeling**.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textw8ZLi9\",\"type\":\"text\",\"body\":\"<h3>Data Science Workflow<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownZN4Gk1\",\"type\":\"markdown\",\"body\":\"[Philip Guo](http:\/\/www.pgbovine.net\/) presents a [Data Science Workflow](http:\/\/cacm.acm.org\/blogs\/blog-cacm\/169199-data-science-workflow-overview-and-challenges\/fulltext) offering a slightly different process model emhasizing the importance of **reflection** and some of the meta-data, data management and bookkeeping challenges that typically arise in the data science process. His 2012 PhD thesis, [Software Tools to Facilitate Research Programming](http:\/\/pgbovine.net\/projects\/pubs\/guo_phd_dissertation.pdf), offers an insightful and more comprehensive description of many of these challenges.\\n\\n<a href=\\\"http:\/\/cacm.acm.org\/blogs\/blog-cacm\/169199-data-science-workflow-overview-and-challenges\/fulltext\\\"><img src=\\\"http:\/\/cacm.acm.org\/system\/assets\/0001\/3678\/rp-overview.jpg\\\" title=\\\"Data Science Workflow, by Philip Guo\\\" alt=\\\"pguo-data-science-overview.jpg\\\" style=\\\"width: 500px\\\" \/><\/a>\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"markdowneMgvkG\",\"type\":\"markdown\",\"body\":\"Provost & Fawcett list a number of different tasks in which data science techniques are employed:\\n\\n* Classification and class probability estimation \\n* Regression (aka value estimation) \\n* Similarity matching \\n* Clustering \\n* Co-occurrence grouping (aka frequent itemset mining, association rule discovery, market-basket analysis) \\n* Profiling (aka behavior description, fraud \/ anomaly detection) \\n* Link prediction \\n* Data reduction \\n* Causal modeling \\n\\nWe will be focusing primarily on **classification** and **class probability estimation** tasks, which are defined by Provost & Fawcett as follows:\\n\\n> *Classification* and *class probability estimation* attempt to predict, for each individual in a population, which of a (small) set of classes this individual belongs to. Usually the classes are mutually exclusive. An example classification question would be: \u201CAmong all the customers of MegaTelCo, which are likely to respond to a given offer?\u201D In this example the two classes could be called will respond and will not respond.\\n\\nTo further simplify this primer, we will focus exclusively on **supervised** methods, in which the data is explicitly labeled with classes. There are also *unsupervised* methods that involve working with data in which there are no pre-specified class labels.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textyLqeVi\",\"type\":\"text\",\"body\":\"<h3>Supervised Classification<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownWoyYo3\",\"type\":\"markdown\",\"body\":\"The [Natural Language Toolkit (NLTK) book](http:\/\/www.nltk.org\/book) provides a diagram and succinct description (below, with italics and bold added for emphasis) of supervised classification:\\n\\n<a href=\\\"http:\/\/www.nltk.org\/book\/ch06.html\\\"><img src=\\\"http:\/\/www.nltk.org\/images\/supervised-classification.png\\\" title=\\\"Supervised Classification, from NLTK book, Chapter 6\\\" alt=\\\"nltk_ch06_supervised-classification.png\\\" style=\\\"width: 500px\\\" \/><\/a>\\n\\n> *Supervised Classification*. (a) During *training*, a **feature extractor** is used to convert each **input value** to a **feature set**. These feature sets, which capture the basic information about each input that should be used to classify it, are discussed in the next section. Pairs of feature sets and **labels** are fed into the **machine learning algorithm** to generate a **model**. (b) During *prediction*, the same feature extractor is used to convert **unseen inputs** to feature sets. These feature sets are then fed into the model, which generates **predicted labels**.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textnBPhsc\",\"type\":\"text\",\"body\":\"<h3>Data Mining Terminology<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownmJy4UO\",\"type\":\"markdown\",\"body\":\"* **Structured** data has simple, well-defined patterns (e.g., a table or graph)\\n* **Unstructured** data has less well-defined patterns (e.g., text, images)\\n* **Model**: a pattern that captures \/ generalizes regularities in data (e.g., an equation, set of rules, decision tree)\\n* **Attribute** (aka *variable*, *feature*, *signal*, *column*): an element used in a model\\n* **Example** (aka *instance*, *feature vector*, *row*): a representation of an entity being modeled\\n* **Target attribute** (aka *dependent variable*, *class label*): the class \/ type \/ category of an entity being modeled\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textCoJWcF\",\"type\":\"text\",\"body\":\"<h3>Data Mining Example: UCI Mushroom dataset<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownX4PQzT\",\"type\":\"markdown\",\"body\":\"The [Center for Machine Learning and Intelligent Systems](http:\/\/cml.ics.uci.edu\/) at the University of California, Irvine (UCI), hosts a [Machine Learning Repository](https:\/\/archive.ics.uci.edu\/ml\/datasets.html) containing over 200 publicly available data sets.\\n\\n<a href=\\\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Mushroom\\\"><img src=\\\"https:\/\/archive.ics.uci.edu\/ml\/assets\/MLimages\/Large73.jpg\\\" style=\\\"margin: 0px 0px 5px 20px; width: 125px; float: right;\\\" title=\\\"Mushrooms from Agaricus and Lepiota Family\\\" alt=\\\"mushroom\\\"\/><\/a>\\nWe will use the [mushroom](https:\/\/archive.ics.uci.edu\/ml\/datasets\/Mushroom) data set, which forms the basis of several examples in Chapter 3 of the Provost & Fawcett data science book.\\n\\nThe following description of the dataset is provided at the UCI repository:\\n\\n>This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525 [The Audubon Society Field Guide to North American Mushrooms, 1981]). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like leaflets three, let it be'' for Poisonous Oak and Ivy.\\n> \\n> **Number of Instances**: 8124\\n> \\n> **Number of Attributes**: 22 (all nominally valued)\\n> \\n> **Attribute Information**: (*classes*: edible=e, poisonous=p)\\n> \\n> 1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\\n> 2. *cap-surface*: fibrous=f, grooves=g, scaly=y, smooth=s\\n> 3. *cap-color*: brown=n ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\\n> 4. *bruises?*: bruises=t, no=f\\n> 5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, none=n, pungent=p, spicy=s\\n> 6. *gill-attachment*: attached=a, descending=d, free=f, notched=n\\n> 7. *gill-spacing*: close=c, crowded=w, distant=d\\n> 8. *gill-size*: broad=b, narrow=n\\n> 9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, white=w, yellow=y\\n> 10. *stalk-shape*: enlarging=e, tapering=t\\n> 11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=?\\n> 12. *stalk-surface-above-ring*: fibrous=f, scaly=y, silky=k, smooth=s\\n> 13. *stalk-surface-below-ring*: fibrous=f, scaly=y, silky=k, smooth=s\\n> 14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\\n> 15. *stalk-color-below-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\\n> 16. *veil-type*: partial=p, universal=u\\n> 17. *veil-color*: brown=n, orange=o, white=w, yellow=y\\n> 18. *ring-number*: none=n, one=o, two=t\\n> 19. *ring-type*: cobwebby=c, evanescent=e, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\\n> 20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, white=w, yellow=y\\n> 21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y\\n> 22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, woods=d\\n> \\n> **Missing Attribute Values**: 2480 of them (denoted by \\\"?\\\"), all for attribute #11.\\n> \\n> **Class Distribution**: -- edible: 4208 (51.8%) -- poisonous: 3916 (48.2%) -- total: 8124 instances\\n\\nThe [data file](https:\/\/archive.ics.uci.edu\/ml\/machine-learning-databases\/mushroom\/agaricus-lepiota.data) associated with this dataset has one instance of a hypothetical mushroom per line, with abbreviations for the values of the class and each of the other 22 attributes separated by commas.\\n\\nHere is a sample line from the data file:\\n\\np,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n\\nThis instance represents a mushroom with the following attribute values:\\n\\n*class*: edible=e, **poisonous=p**\\n\\n1. *cap-shape*: bell=b, conical=c, convex=x, flat=f, **knobbed=k**, sunken=s\\n2. *cap-surface*: **fibrous=f**, grooves=g, scaly=y, smooth=s\\n3. *cap-color*: **brown=n** ,buff=b, cinnamon=c, gray=g, green=r, pink=p, purple=u, red=e, white=w, yellow=y\\n4. *bruises?*: bruises=t, **no=f**\\n5. *odor*: almond=a, anise=l, creosote=c, fishy=y, foul=f, musty=m, **none=n**, pungent=p, spicy=s\\n6. *gill-attachment*: attached=a, descending=d, **free=f**, notched=n\\n7. *gill-spacing*: **close=c**, crowded=w, distant=d\\n8. *gill-size*: broad=b, **narrow=n**\\n9. *gill-color*: black=k, brown=n, buff=b, chocolate=h, gray=g, green=r, orange=o, pink=p, purple=u, red=e, **white=w**, yellow=y\\n10. *stalk-shape*: **enlarging=e**, tapering=t\\n11. *stalk-root*: bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, **missing=?**\\n12. *stalk-surface-above-ring*: fibrous=f, scaly=y, **silky=k**, smooth=s\\n13. *stalk-surface-below-ring*: fibrous=f, **scaly=y**, silky=k, smooth=s\\n14. *stalk-color-above-ring*: brown=n, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, **white=w**, yellow=y\\n15. *stalk-color-below-ring*: **brown=n**, buff=b, cinnamon=c, gray=g, orange=o, pink=p, red=e, white=w, yellow=y\\n16. *veil-type*: **partial=p**, universal=u\\n17. *veil-color*: brown=n, orange=o, **white=w**, yellow=y\\n18. *ring-number*: none=n, **one=o**, two=t\\n19. *ring-type*: cobwebby=c, **evanescent=e**, flaring=f, large=l, none=n, pendant=p, sheathing=s, zone=z\\n20. *spore-print-color*: black=k, brown=n, buff=b, chocolate=h, green=r, orange=o, purple=u, **white=w**, yellow=y\\n21. *population*: abundant=a, clustered=c, numerous=n, scattered=s, **several=v**, solitary=y\\n22. *habitat*: grasses=g, leaves=l, meadows=m, paths=p, urban=u, waste=w, **woods=d**\\n\\nBuilding a model with this data set will serve as a motivating example throughout much of this primer.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textGPJ4iE\",\"type\":\"text\",\"body\":\"<h2>3. Python: Basic Concepts<\/h2>\",\"evaluatorReader\":false},{\"id\":\"textfSaX0W\",\"type\":\"text\",\"body\":\"<h3>Identifiers, strings, lists and tuples<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownukDC6t\",\"type\":\"markdown\",\"body\":\"The sample instance shown above can be represented as a string. A Python *string* ([str](http:\/\/docs.python.org\/2\/tutorial\/introduction.html#strings)) is a sequence of 0 or more characters enclosed within a pair of single quotes (') or a pair double quotes (\"). \",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeSp56HJ\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'\"},\"output\":{\"selectedType\":\"Text\",\"result\":\"'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'\"},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownET1Iai\",\"type\":\"markdown\",\"body\":\"Python [*identifiers*](http:\/\/docs.python.org\/2\/reference\/lexical_analysis.html#identifiers) (or [*names*](https:\/\/docs.python.org\/2\/reference\/executionmodel.html#naming-and-binding)) are composed of letters, numbers and\/or underscores ('_'), starting with a letter or underscore. Python identifiers are case sensitive. Although camelCase identifiers can be used, it is generally considered more [pythonic](http:\/\/python.net\/~goodger\/projects\/pycon\/2007\/idiomatic\/handout.html) to use underscores. Python variables and functions typically start with lowercase letters; Python classes start with uppercase letters.\\n\\nThe following [assignment statement](http:\/\/docs.python.org\/2\/reference\/simple_stmts.html#assignment-statements) binds the value of the string shown above to the namesingle_instance_str.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codekEwJ7D\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"single_instance_str = 'p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'\"},\"output\":{\"result\":\"\"},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownc3sFMP\",\"type\":\"markdown\",\"body\":\"The [**print**](http:\/\/docs.python.org\/2\/tutorial\/inputoutput.html) statement writes the value of its comma-delimited arguments to [sys.stdout](http:\/\/docs.python.org\/2\/library\/sys.html#sys.stdout) (typically the console). Each value in the output is separated by a single blank space. If the last argument is followed by a comma, the output cursor will stay on the same line.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code84pxd3\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print 'Instance 1:', single_instance_str\\nprint 'A', 'B', # note comma at the end\\nprint 'C' # will appear on same line\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":3},{\"id\":\"markdownR79ybQ\",\"type\":\"markdown\",\"body\":\"The Python comment character is'#': anything after'#'on the line is ignored by the Python interpreter. \\n\\nPairs of triple quotes ('''or\"\"\") can be used to delimit multi-line comments.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codehrGyku\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"'''\\nA multi-line\\ncomment\\n'''\\nprint 'no comment'\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":5},{\"id\":\"markdown6YIOof\",\"type\":\"markdown\",\"body\":\"A [list](http:\/\/docs.python.org\/2\/tutorial\/introduction.html#lists) is an ordered sequence of 0 or more comma-delimited elements enclosed within square brackets ('[', ']'). The Python [str.split(sep)](http:\/\/docs.python.org\/2\/library\/stdtypes.html#str.split) method can be used to split asep-delimited string into a corresponding list of elements.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeyjd5ul\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"single_instance_list = single_instance_str.split(',')\\nprint single_instance_list\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownNdgpta\",\"type\":\"markdown\",\"body\":\"Python sequences are *heterogeneous*, i.e., they can contain elements of different types.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeozJEAc\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"mixed_list = ['a', 1, 2.3, True, [1, 'b']]\\nprint mixed_list\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownyxdfes\",\"type\":\"markdown\",\"body\":\"The Python **+** operator can be used to concatenate lists.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeCkOYjs\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"concatenated_list = ['a', 1] + [2.3, True] + [[1, 'b']]\\nprint concatenated_list\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownVMqbJ8\",\"type\":\"markdown\",\"body\":\"Individual elements of [*sequences*](http:\/\/docs.python.org\/2\/library\/stdtypes.html#typesseq) (lists, strings and other data structures) can be accessed by specifying their zero-based index position within square brackets ('[', ']').\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code87ooHl\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print single_instance_str[2], single_instance_list[2]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdowntkqkQo\",\"type\":\"markdown\",\"body\":\"Negative index values can be used to specify a position offset from the end of the sequence. It is often useful to use a-1index value to access the last element of a sequence.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeMWwe3Y\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print single_instance_str[-1], single_instance_list[-1]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownaoFlR9\",\"type\":\"markdown\",\"body\":\"The Python *slice notation* can be used to access subsequences by specifying two index positions separated by a colon (':');seq[start:stop]returns all the elements inseqbetweenstartandstop - 1(inclusive).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeaQXiXx\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print single_instance_str[2:4]\\nprint single_instance_list[2:4]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownEtj9jF\",\"type\":\"markdown\",\"body\":\"Slices indices can be negative values.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codelvJYOv\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print single_instance_str[-4:-2]\\nprint single_instance_list[-4:-2]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownL6GiPr\",\"type\":\"markdown\",\"body\":\"Thestartand\/orstopindex can be omitted. A common use of slices with a single index value is to access all but the first element or all but the last element of a sequence.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codec6B3MX\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print single_instance_str[:-1] # all but the last\\nprint single_instance_list[:-1]\\nprint single_instance_str[1:] # all but the first\\nprint single_instance_list[1:]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":4},{\"id\":\"markdownIT52F4\",\"type\":\"markdown\",\"body\":\"Slice notation includes an optional third element,step, as inseq[start:stop:step], that specifies the steps or increments by which elements are retrieved fromseqbetweenstartandstep - 1:\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeprsGO2\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print single_instance_str\\nprint single_instance_str[::2] # print elements in even-numbered positions (the values, in this case)\\nprint single_instance_str[1::2] # print elements in odd-numbered positions (the commas, in this case)\\nprint single_instance_str[::-1] # reverse the string\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":4},{\"id\":\"markdownPQ8CjI\",\"type\":\"markdown\",\"body\":\"The [Python tutorial](http:\/\/docs.python.org\/2\/tutorial\/introduction.html) offers a helpful ASCII art representation to show how positive and negative indexes are interpreted:\\n\\n<pre>\\n +---+---+---+---+---+\\n | H | e | l | p | A |\\n +---+---+---+---+---+\\n 0 1 2 3 4 5\\n-5 -4 -3 -2 -1\\n<\/pre>\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"markdown2E9H8m\",\"type\":\"markdown\",\"body\":\"Python statements are typically separated by newlines (rather than, say, the semi-colon in Java). Statements can extend over more than one line; it is generally best to break the lines after commas within parentheses, braces or brackets. Inserting a backslash character ('\\\\\\\\') at the end of a line will also enable continuation of the statement on the next line, but it is generally best to look for other alternatives.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code8oICRb\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute_names = ['class', \\n 'cap-shape', 'cap-surface', 'cap-color', \\n 'bruises?', \\n 'odor', \\n 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', \\n 'stalk-shape', 'stalk-root', \\n 'stalk-surface-above-ring', 'stalk-surface-below-ring', \\n 'stalk-color-above-ring', 'stalk-color-below-ring',\\n 'veil-type', 'veil-color', \\n 'ring-number', 'ring-type', \\n 'spore-print-color', \\n 'population', \\n 'habitat']\\nprint attribute_names\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":14},{\"id\":\"markdownHc0R9i\",\"type\":\"markdown\",\"body\":\"The [str.strip(\[chars\])](http://docs.python.org/2/library/stdtypes.html#str.strip) method returns a copy ofstrin which any leading or trailingcharsare removed. If nocharsare specified, it removes all leading and trailing whitespace. [_Whitespace_ is any sequence of spaces, tabs ('\\t') and\/or newline ('\\n') characters.] \",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codep1kOsZ\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print '_', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n', '_'\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"codeK0rrFI\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print '_', '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'.strip(), '_'\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownUQCBru\",\"type\":\"markdown\",\"body\":\"A common programming pattern when dealing with CSV (comma-separated values) data files containing is to repeatedly\n\n1. read a line from a file\n2. strip off any leading and trailing whitespace\n3. split the values separated by commas into a list\n\nWe will get to repetition control structures (loops) and file input and output shortly, but here is an example of howstr.strip()andstr.split()be chained together in a single instruction:\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code2gF9mY\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"single_instance_str = '\\tp,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d\\n'\nsingle_instance_list = single_instance_str.strip().split(',') # first strip leading & trailing whitespace, then split on commas\nprint single_instance_list\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":3},{\"id\":\"markdownh7AgTb\",\"type\":\"markdown\",\"body\":\"The [str.join(words)](http://docs.python.org/2/library/string.html#string.join) method is the inverse ofstr.split(), returning a single string in which each string in the sequence ofwordsis separated bystr.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code3q754c\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print '_', ','.join(single_instance_list), '_'\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdown1X1Re5\",\"type\":\"markdown\",\"body\":\"A number of Python methods can be used on strings, lists and other sequences.\n\nThe [len(s)](http://docs.python.org/2/library/functions.html#len) function can be used to find the length of (number of items in) a sequences. It will also return the number of items in a _dictionary_, a data structure we will cover further below.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeWjDjhb\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print len(single_instance_str), len(single_instance_list)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdown4YcclW\",\"type\":\"markdown\",\"body\":\"The **in** operator can be used to determine whether a sequence contains a value. \n\nBoolean values in Python are **True** and **False** (note the capitalization).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeolKIui\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print ',' in single_instance_str, ',' in single_instance_list\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownHqVMID\",\"type\":\"markdown\",\"body\":\"The [s.count(x)](http://docs.python.org/2/library/stdtypes.html#str.count) ormethod can be used to count the number of occurrences of itemxin sequences.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeMAjMVJ\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print single_instance_str.count(','), single_instance_list.count('f')\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownt0i7Gr\",\"type\":\"markdown\",\"body\":\"The [s.index(x)](http://docs.python.org/2/library/stdtypes.html#str.index) method can be used to find the first 0-based index of itemxin sequences. \",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"coderrjSZs\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print single_instance_str.index(','), single_instance_list.index('f')\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownBDWA4g\",\"type\":\"markdown\",\"body\":\"One important distinction between strings and lists has to do with their [_mutability_](http://docs.python.org/2/reference/datamodel.html).\n\nPython strings are _immutable_, i.e., they cannot be modified. Most string methods (likestr.strip()) return modified _copies_ of the strings on which they are used.\n\nPython lists are _mutable_, i.e., they can be modified. \n\nThe examples below illustrate a number of [list](http://docs.python.org/2/tutorial/datastructures.html#more-on-lists) methods that modify lists.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codezl4Qpu\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"list_1 = [4, 2, 3, 5, 1]\nlist_2 = list_1 # list_2 now references the same object as list_1\nprint 'list_1: ', list_1\nprint 'list_2: ', list_2\nlist_1.remove(1)\nprint 'list_1.remove(1):', list_1\nlist_1.append(6)\nprint 'list_1.append(6):', list_1\nlist_1.sort()\nprint 'list_1.sort(): ', list_1\nlist_1.reverse()\nprint 'list_1.reverse():', list_1\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":12},{\"id\":\"markdownJOlpd7\",\"type\":\"markdown\",\"body\":\"When more than one name (e.g., a variable) is bound to the same mutable object, changes made to that object are reflected in all names bound to that object. For example, in the second statement above,list_2is bound to the same object that is bound tolist_1, namely, the list[4, 2, 3, 5 1]. All changes made to the object bound tolist_1will thus be reflected inlist_2(since they both reference the same object).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codektLDiB\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print 'list_1: ', list_1\nprint 'list_2: ', list_2\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownZjrF4z\",\"type\":\"markdown\",\"body\":\"There are sorting and reversing functions, [sorted()](https://docs.python.org/2.7/library/functions.html#sorted) and [reversed()](https://docs.python.org/2.7/library/functions.html#reversed), that do not modify their arguments, and can thus be used on mutable or immutable objects. We will elaborate on each of these functions further below, but here are a couple of examples of howsorted()returns a sorted list of each element in its argument.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeIMxq8D\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print 'sorted(list_1):', sorted(list_1) # return a copy of list_1 in sorted order\nprint 'list_1: ', list_1\nprint 'sorted(single_instance_str):', sorted(single_instance_str) # returns a list of sorted elements in the string\nprint 'single_instance_str: ', single_instance_str\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":4},{\"id\":\"markdowne2EVtK\",\"type\":\"markdown\",\"body\":\"A [_tuple_](http://docs.python.org/2/tutorial/datastructures.html#tuples-and-sequences) is an ordered, immutable sequence of 0 or more comma-delimited values enclosed in parentheses ('(',')'). Many of the functions that operate on strings and lists also operate on tuples.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeWVNcjk\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"x = (1, 2, 3, 4, 5) # a tuple\nprint 'x =', x, ', len(x) =', len(x), ', x.index(3) =', x.index(3), ', x[4:2:-1] = ', x[4:2:-1]\nprint 'sorted(x, reverse=True):', sorted(x, reverse=True) # sorted always returns a list; reverse=True specifies reverse sort order\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":3},{\"id\":\"markdownNMmy2Y\",\"type\":\"markdown\",\"body\":\"If thes.index(x)orlist.remove(x)method is used on a sequencesorlistthat does not contain the valuex, a [ValueError](http://docs.python.org/2/library/exceptions.html#exceptions.ValueError) exception is raised.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codenPMLnH\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print x.index(6) # a ValueError will be raised\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"textEGIgOd\",\"type\":\"text\",\"body\":\"<h3>Conditionals<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownnAhToo\",\"type\":\"markdown\",\"body\":\"One common approach to handling errors is to _look before you leap (LBYL)_, i.e., test for potential [exceptions](http://docs.python.org/2/tutorial/errors.html) before executing instructions that might raise those exceptions. \n\nThis approach can be implemented using the [**if**](http://docs.python.org/2/tutorial/controlflow.html#if-statements) statement (which may optionally include an **else** and any number of **elif** clauses).\n\nThe following is a simple example of anifstatement:\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeLMBWif\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"class_value = 'e' # try changing this to 'p' or 'x'\n\nif class_value == 'e':\n print 'edible'\nelif class_value == 'p':\n print 'poisonous'\nelse:\n print 'unknown'\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":8},{\"id\":\"markdown6IG1sk\",\"type\":\"markdown\",\"body\":\"Note that \n\n\* a colon (':') is used at the end of the lines withif,elseorelif\n\* no parentheses are required to enclose the boolean condition (it is presumed to include everything betweeniforelifand the colon)\n\* the statements below eachif,elifandelseline are all indented\n\nPython does not have special characters to delimit statement blocks (like the '{' and '}' delimiters in Java); instead, sequences of statements with the same _indentation level_ are treated as a statement block. The [Python Style Guide](http://legacy.python.org/dev/peps/pep-0008/) recommends using 4 spaces for each indentation level.\n\nAnifstatement can be used to follow the LBYL paradigm in preventing theValueErrorthat occured in an earlier example:\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeMbZP2Q\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n\nif attribute in attribute_names:\n i = attribute_names.index(attribute)\n print attribute, 'is in position', i\nelse:\n print attribute, 'is not in', attribute_names\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":7},{\"id\":\"markdownwxanpn\",\"type\":\"markdown\",\"body\":\"Another perspective on handling errors championed by some pythonistas is that it is [_easier to ask forgiveness than permission (EAFP)_](http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#eafp-vs-lbyl).\n\nAs in many practical applications of philosophy, religion or dogma, it is helpful to _think before you choose (TBYC)_. There are a number of factors to consider in deciding whether to follow the EAFP or LBYL paradigm, including code readability and the anticipated likelihood and relative severity of encountering an exception. Oran Looney wrote a blog post providing a nice overview of the debate over [LBYL vs. EAFP](http://oranlooney.com/lbyl-vs-eafp/).\n\nWe will follow the LBYL paradigm throughout most of this primer. However, as an illustration of EAFP in Python, here is an alternate implementation of the functionality of the code above, using a [try\/except](http://docs.python.org/2/tutorial/errors.html#handling-exceptions) statement.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codePUZtsY\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute = 'bruises' # try substituting 'bruises?' for 'bruises' and re-running this code\n\ntry:\n i = attribute_names.index(attribute)\n print attribute, 'is in position', i\nexcept ValueError:\n print attribute, 'is not found'\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":7},{\"id\":\"markdownhmMWLj\",\"type\":\"markdown\",\"body\":\"The Python _null object_ is **None** (note the capitalization).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeKAOh27\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute = 'bruises?'\n\nif attribute not in attribute_names: # equivalent to 'not attribute in attribute_names'\n value = None\nelse:\n i = attribute_names.index(attribute)\n value = single_instance_list[i]\n \nprint attribute, '=', value\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":9},{\"id\":\"textBTVzKz\",\"type\":\"text\",\"body\":\"<h3>Defining and calling functions<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdowneIP3lV\",\"type\":\"markdown\",\"body\":\"Python [_functions definitions_](http://docs.python.org/2/tutorial/controlflow.html#defining-functions) start with the **def** keyword followed by a function name, a list of 0 or more comma-delimited _parameters_ (aka 'formal parameters') enclosed within parentheses, and then a colon (':'). \n\nA function definition may include one or more [**return**](http://docs.python.org/2/reference/simple_stmts.html#the-return-statement) statemens to indicate the value(s) returned to where the function is called. It is good practice to include a short [docstring](http://docs.python.org/2/tutorial/controlflow.html#tut-docstrings) to briefly describe the behavior of the function and the value(s) it returns.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeIHM37I\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def attribute_value(instance, attribute, attribute_names):\n '''Returns the value of attribute in instance, based on the position of attribute in the list of attribute_names'''\n if attribute not in attribute_names:\n return None\n else:\n i = attribute_names.index(attribute)\n return instance[i] # using the parameter name here\"},\"output\":{\"result\":\"\"},\"evaluatorReader\":false,\"lineCount\":7},{\"id\":\"markdown4IkQ5h\",\"type\":\"markdown\",\"body\":\"A _function call_ starts with the function name, followed by a list of 0 or more comma-delimited _arguments_ (aka 'actual parameters') enclosed within parentheses. A function call can be used as a statement or within an expression.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code9EfGRd\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute = 'cap-shape' # try substituting any of the other attribute names shown above\nprint attribute, '=', attribute_value(single_instance_list, attribute, attribute_names)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownje2dq1\",\"type\":\"markdown\",\"body\":\"Note that Python does not distinguish between names used for _variables_ and names used for _functions_. An assignment statement binds a value to a name; a function definition also binds a value to a name. At any given time, the value most recently bound to a name is the one that is used. \n\nThe [type(object)](http://docs.python.org/2.7/library/functions.html#type) function returns thetypeofobject.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code6fOOWQ\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"x = 0\nprint 'x used as a variable:', x, type(x)\ndef x():\n print 'x'\nprint 'x used as a function:', x, type(x)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":5},{\"id\":\"markdownsvCEnq\",\"type\":\"markdown\",\"body\":\"Also note that Python function arguments are passed using _call by object reference_. Thus any modifications made to a parameter that has been passed a mutable object bound to a name as an argument will persist after the function exits.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeSGeyqe\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def insert_x(list_parameter):\n '''Inserts \\"x\\" at the head of a list, modifying the list argument'''\n list_parameter.insert(0, 'x')\n print 'Inserted x:', list_parameter\n return list_parameter\n\ninsert_x([1, 2, 3]) # passing an unnamed object does not affect any existing names\nlist_argument = [1, 2, 3] # passing a named object will affect the object bound to that name\nprint 'Before:', list_argument\ninsert_x(list_argument)\nprint 'After:', list_argument\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":11},{\"id\":\"markdownNUvP2Y\",\"type\":\"markdown\",\"body\":\"One way of preventing functions from modifying mutable objects passed as parameters is to make a copy of those objects inside the function. Here is another version of the function above that makes a shallow copy of the _list_parameter_ using the slice operator. \n\n_\\[Note: the Python [copy](http://docs.python.org/2/library/copy.html) module provides both [shallow] [copy()](http://docs.python.org/2/library/copy.html#copy.copy) and [deepcopy()](http://docs.python.org/2/library/copy.html#copy.deepcopy) methods; we will cover modules further below.]_\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code2D913p\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def insert_x_copy(list_parameter):\n '''Inserts \\"x\\" at the head of a list, without modifying the list argument'''\n list_parameter_copy = list_parameter[:]\n list_parameter_copy.insert(0, 'x')\n print 'Inserted x:', list_parameter_copy\n return list_parameter_copy\n\ninsert_x_copy([1, 2, 3]) # passing an unnamed object does not affect any existing names\nlist_argument = [1, 2, 3] # passing a named object will affect the object bound to that name\nprint 'Before:', list_argument\ninsert_x_copy(list_argument)\nprint 'After:', list_argument\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":12},{\"id\":\"markdownxjkABE\",\"type\":\"markdown\",\"body\":\"Python functions can return more than one value, by separating those return values with commas in the **return** statement. Multiple values are returned as a tuple. If the function-invoking expression is an assignment statement, multiple variables can be assigned the multiple values returned by the function in a single statement. This combining of values and subsequent separation is known as tuple _packing_ and _unpacking_.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeb0GYzw\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def min_and_max(list_of_values):\n '''Returns a tuple containing the min and max values in the list_of_values'''\n return min(list_of_values), max(list_of_values)\n\nlist_1 = [3, 1, 4, 2, 5]\nprint 'min and max of', list_1, ':', min_and_max(list_1)\n\nmin_and_max_list_1 = min_and_max(list_1) # a single variable is assigned the two-element tuple\nprint 'min and max of', list_1, ':', min_and_max_list_1\n\nmin_list_1, max_list_1 = min_and_max(list_1) # the 1st variable is assigned the 1st value, the 2nd variable is assigned the 2nd value\nprint 'min and max of', list_1, ':', min_list_1, ',', max_list_1\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":12},{\"id\":\"textexH9cy\",\"type\":\"text\",\"body\":\"<h3>Iteration: for, range<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownLGffAR\",\"type\":\"markdown\",\"body\":\"The [**for**](http://docs.python.org/2/tutorial/controlflow.html#for-statements) statement iterates over the elements of a sequence.\n\nThe [range(stop)](http://docs.python.org/2/tutorial/controlflow.html#the-range-function) function returns a list of values from 0 up tostop - 1(inclusive). \",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codewKnewi\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print 'Index values for attributes:', range(len(attribute_names)), '\\n'\n\nprint 'Values for the', len(attribute_names), 'attributes:\\n'\nfor i in range(len(attribute_names)):\n print attribute_names[i], '=', attribute_value(single_instance_list, attribute_names[i], attribute_names)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":5},{\"id\":\"markdownNmYtOR\",\"type\":\"markdown\",\"body\":\"The more general form of the function, [range(start, stop[, step])](http://docs.python.org/2/library/functions.html#range), returns a list of values fromstarttostop - 1(inclusive) increasing bystep(which defaults to1), or fromstartdown tostop + 1(inclusive) decreasing bystepifstepis negative.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codedkAMWC\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print 'range(5, 10):', range(5, 10)\nprint 'range(10, 5, -1):', range(10, 5, -1)\nprint 'range(0, 10, 2):', range(0, 10, 2)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":3},{\"id\":\"markdownoVx5Bp\",\"type\":\"markdown\",\"body\":\"The [xrange(stop[, stop[, step]])](http://docs.python.org/2/library/functions.html#xrange) function is an [_iterable_](http://docs.python.org/2/glossary.html#term-iterable) version of therange()function. In the context of aforloop, it returns the _next_ item of the sequence for each iteration of the loop rather than creating _all_ the elements of the sequence before the first iteration. This can reduce memory consumption in cases where iteration over all the items is not required. \n\nTherange()function returns a list, which can then be manipulated by any list or sequence methods. An _xrange object_ can only be used in aforloop or thelen()function. A related and slightly more general class of container objects, [_iterators_](http://docs.python.org/2/library/stdtypes.html#typeiter), include a [next()](http://docs.python.org/2/library/stdtypes.html#iterator.next) method for explicitly returning the next item in the container.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"coden2HVyN\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print xrange(len(attribute_names)), '\\n'\n\nprint 'Values for the', len(attribute_names), 'attributes:\\n'\nfor i in xrange(len(attribute_names)):\n print attribute_names[i], '=', attribute_value(single_instance_list, attribute_names[i], attribute_names)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":5},{\"id\":\"textTNsscg\",\"type\":\"text\",\"body\":\"<h3>Modules, namespaces and dotted notation<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownwUJHiY\",\"type\":\"markdown\",\"body\":\"A Python [**_module**_](http://docs.python.org/2/tutorial/modules.html) is a file containing related definitions (e.g., of functions and variables). Modules are used to help organize the Python [**_namespaces**_](http://docs.python.org/2/tutorial/classes.html#python-scopes-and-namespaces), the set of identifiers accessible in a particular contexts. All of the functions and variables we define in this IPython Notebook are in themainnamespace, so accessing them does not require any specification of a module.\\n\\nA Python module namedsimple_ml(in the filesimple_ml.py), contains a set of solutions to the exercises in this IPython Notebook. Accessing functions in that module requires that we first [import](http:\/\/docs.python.org\/2\/reference\/simple_stmts.html#the-import-statement) the module, and then prefix the function names with the module name followed by a dot (this is known as ***dotted notation***).\\n\\nFor example, the following function call Exercise 1 below: \\n\\nsimple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)\\n\\nuses dotted notation to reference theprint_attribute_names_and_values()function in thesimple_mlmodule.\\n\\nAfter you have defined your function in Exercise 1, you can test it by deleting thesimple_mlmodule specification, so that the statement becomes\\n\\nprint_attribute_names_and_values(single_instance_list, attribute_names)\\n\\nThis will reference theprint_attribute_names_and_values()function in the current namespace (main), i.e., the top-level interpreter environment. Thesimple_ml.print_attribute_names_and_values()function will still be accessible in thesimple_mlnamespace by using the \\\"simple_ml.\\\" prefix.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textJWiYvC\",\"type\":\"text\",\"body\":\"<h3>Exercise 1: define print_attribute_names_and_values()<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownwWGRI9\",\"type\":\"markdown\",\"body\":\"Complete the following function definition,print_attribute_names_and_values(instance, attribute_names), so that it generates exactly the same output as the code above.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeKJ7uRK\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def print_attribute_names_and_values(instance, attribute_names):\\n '''Prints the attribute names and values for an instance'''\\n # your code goes here\\n return\\n\\nimport simple_ml # this module contains my solutions to exercises\\n# to test your function, delete the 'simple_ml.' module specification in the call to print_attribute_names_and_values() below\\nsimple_ml.print_attribute_names_and_values(single_instance_list, attribute_names)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":8},{\"id\":\"textrZ2uUM\",\"type\":\"text\",\"body\":\"<h3>File I\/O<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownoGan1g\",\"type\":\"markdown\",\"body\":\"Python [file input and output](http:\/\/docs.python.org\/2\/tutorial\/inputoutput.html#reading-and-writing-files) is done through [file](http:\/\/docs.python.org\/2\/library\/stdtypes.html#file-objects) objects. A file object is created with the [open(name[, mode])](http:\/\/docs.python.org\/2\/library\/functions.html#open) statement, wherenameis a string representing the name of the file, andmodeis'r'(read),'w'(write) or'a'(append); if no second argument is provided, the mode defaults to'r'.\\n\\nA common Python programming pattern for processing an input text file is to \\n\\n* [**open**](http:\/\/docs.python.org\/2\/library\/functions.html#open) the file using a [**with**](http:\/\/docs.python.org\/2\/reference\/compound_stmts.html#the-with-statement) statement (which will automatically [**close**](http:\/\/docs.python.org\/2\/library\/stdtypes.html#file.close) the file after the statements inside thewithexecute)\\n* iterate over each line in the file using a **for** statement\\n\\nThe following code creates a list of instances, where each instance is a list of attribute values (likeinstance_1_strabove). \",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codegbkxOm\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"all_instances = [] # initialize instances to an empty list\\ndata_filename = 'agaricus-lepiota.data'\\n\\nwith open(data_filename, 'r') as f:\\n for line in f:\\n all_instances.append(line.strip().split(','))\\n \\nprint 'Read', len(all_instances), 'instances from', data_filename\\nprint 'First instance:', all_instances[0] # we don't want to print all the instances, so let's just print one to verify\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":9},{\"id\":\"textlfl23j\",\"type\":\"text\",\"body\":\"<h3>Exercise 2: define load_instances()<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownXa9gzq\",\"type\":\"markdown\",\"body\":\"Define a function,load_instances(filename), that returns a list of instances in a text file. The function definition is started for you below. The function should exhibit the same behavior as the code above.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codejS0432\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def load_instances(filename):\\n '''Returns a list of instances stored in a file.\\n \\n filename is expected to have a series of comma-separated attribute values per line, e.g.,\\n p,k,f,n,f,n,f,c,n,w,e,?,k,y,w,n,p,w,o,e,w,v,d'''\\n instances = []\\n # your code goes here\\n return instances\\n\\ndata_filename = 'agaricus-lepiota.data'\\n# to test your function, delete the 'simple_ml.' module specification in the call to load_instances() below\\nall_instances_2 = simple_ml.load_instances(data_filename)\\nprint 'Read', len(all_instances_2), 'instances from', data_filename\\nprint 'First instance:', all_instances_2[0] \"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":14},{\"id\":\"markdownWNgbJb\",\"type\":\"markdown\",\"body\":\"Output to text file is usually done via [file.write(str)](http:\/\/docs.python.org\/2\/library\/stdtypes.html#file.write) method.\\n\\nAs we saw earlier, the [str.join(words)](http:\/\/docs.python.org\/2\/library\/stdtypes.html#str.join) method returns a singlestr-delimited string containing each of the strings in the listwords.\\n\\nSQL and Hive database tables often use the pipe ('|') delimiter to separate column values for each row when they are stored as flat files. The following code creates a new data file using pipes rather than commas to separate the attribute values.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeB0yDCN\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print 'Converting to pipe delimiter, e.g.,', '|'.join(all_instances[0])\\n\\ndatafile2 = 'agaricus-lepiota-2.data'\\nwith open(datafile2, 'w') as f:\\n for instance in all_instances:\\n f.write('|'.join(instance) + '\\\\n') # '+' is the concatenation operator when used with strings\\n\\nall_instances_3 = []\\nwith open(datafile2, 'r') as f:\\n for line in f:\\n all_instances_3.append(line.strip().split('|')) # note: changed ',' to '|'\\nprint 'Read', len(all_instances_3), 'instances from', datafile2\\nprint 'First instance:', all_instances_3[0] # we don't want to print all the instances, so let's just print one to verify\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":13},{\"id\":\"textpp4OJU\",\"type\":\"text\",\"body\":\"<h3>List comprehensions<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdown4fKtAj\",\"type\":\"markdown\",\"body\":\"Python provides a powerful [*list comprehension*](http:\/\/docs.python.org\/2\/tutorial\/datastructures.html#list-comprehensions) construct to simplify the creation of a list by specifying a formula in a single expression.\\n\\nSome programmers find list comprehensions confusing, and avoid their use. We won't rely on list comprehensions here, but will show examples with and without list comprehensions below.\\n\\nOne common use of list comprehensions is in the context of the [str.join(words)](http:\/\/docs.python.org\/2\/library\/string.html#string.join) method we saw earlier.\\n\\nIf we wanted to construct a pipe-delimited string containing elements of the list, we could use aforloop to iteratively add list elements and pipe delimiters to a string. We would thereby add one pipe delimiter too many, and would thus have to shave that off at the end.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codelnV1yi\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"pipe_delimited_string = ''\\nfor x in [1, 2, 3]:\\n pipe_delimited_string += str(x) + '|'\\npipe_delimited_string = pipe_delimited_string[:-1]\\npipe_delimited_string\"},\"output\":{\"selectedType\":\"Text\",\"result\":\"'1|2|3'\"},\"evaluatorReader\":false,\"lineCount\":5},{\"id\":\"markdownWHdE3m\",\"type\":\"markdown\",\"body\":\"This process is much simpler using a list comprehension.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeyVLBZM\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"'|'.join([str(x) for x in [1, 2, 3]])\"},\"output\":{\"selectedType\":\"Text\",\"result\":\"'1|2|3'\"},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownHtymoi\",\"type\":\"markdown\",\"body\":\"As noted in the initial description of the UCI mushroom set above, 2480 of the 8124 instances have missing values (denoted by'?') for an attribute. There are several techniques for dealing with instances that include missing values, but to simplify things in the context of this primer - and following the example in the [Data Science for Business](http:\/\/www.data-science-for-biz.com\/) book - we will restrict our focus to only those *clean* instances that have no missing values.\\n\\nWe could use several lines of code - with anifstatement inside aforloop - to create aclean_instanceslist from theall_instanceslist. Or we could use a list comprehension.\\n\\nWe will show both approaches to creatingclean_instancesbelow.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeEblabw\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"# version 1: using an if statement nested within a for statement\\nclean_instances = []\\nfor instance in all_instances:\\n if '?' not in instance:\\n clean_instances.append(instance)\\n \\nprint len(clean_instances), 'clean instances'\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":7},{\"id\":\"codee8yU8S\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"# version 2: using an equivalent list comprehension\\nclean_instances_2 = [instance for instance in all_instances if '?' not in instance]\\n\\nprint len(clean_instances_2), 'clean instances'\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":4},{\"id\":\"textIwD8rb\",\"type\":\"text\",\"body\":\"<h3>Dictionaries (dicts)<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownSiuOQL\",\"type\":\"markdown\",\"body\":\"Although single character abbreviations of attribute values (e.g., 'x') allow for more compact data files, they are not as easy to understand by human readers as the longer attribute value descriptions (e.g., 'convex').\\n\\nA Python [dictionary (ordict)](http:\/\/docs.python.org\/2\/tutorial\/datastructures.html#dictionaries) is an unordered, comma-delimited collection of *key, value* pairs, serving a siimilar function as a hash table or hashmap in other programming languages.\\n\\nWe could create a dictionary for thecap-typeattribute values shown above:\\n\\n> bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\\n\\nSince we will want to look up the value using the abbreviation (which is the representation of the value stored in the file), we will use the abbreviations as *keys* and the descriptions as *values*.\\n\\nA Python dictionary can be created by specifying allkey: valuepairs (with colons separating each *key* and *value*), or by adding them iteratively. We will use the first method below, and use the second method further below. A *value* in a Python dictionary (dict) is accessed by specifying its *key* using the general formdict[key].\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeM5wzPn\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute_values_cap_type = {'b': 'bell', 'c': 'conical', 'x': 'convex', 'f': 'flat', 'k': 'knobbed', 's': 'sunken'}\\n\\nattribute_value_abbrev = 'x'\\nprint attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":4},{\"id\":\"markdownY9nVez\",\"type\":\"markdown\",\"body\":\"A Python dictionary is an *iterable* container, so we can iterate over the keys in a dictionary using aforloop.\\n\\nNote that since a dictionary is an *unordered* collection, the sequence of abbreviations and associated values is not guaranteed to appear in any particular order. \",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codevHU6rK\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"for attribute_value_abbrev in attribute_values_cap_type:\\n print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownwaNTNd\",\"type\":\"markdown\",\"body\":\"Python supports *dictionary comprehensions*, which have a similar form as the *list comprehensions* described above.\\n\\nFor example, if we provisionally omit the 'convex' cap-type (whose abbreviation is the last letter rather than first letter in the attribute name), we could construct a dictionary of abbreviations and descriptions using the following expression.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeds2gIF\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute_values_cap_type_2 ={x[0]: x for x in ['bell', 'conical', 'flat', 'knobbed', 'sunken']}\\nprint attribute_values_cap_type_2\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownP6WJ6G\",\"type\":\"markdown\",\"body\":\"While it's useful to have a dictionary of values for thecap-typeattribute, it would be even more useful to have a dictionary of values for every attribute. Earlier, we created a list ofattribute_names; let's expand this to create a list ofattribute_valueswherein each list element is a dictionary.\\n\\nRather than explicitly type in each dictionary entry in the Python interpreter, we'll define a function to read a file containing the list of attribute names, values and value abbreviations in the format shown above:\\n\\n* class: edible=e, poisonous=p\\n* cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\\n* cap-surface: fibrous=f, grooves=g, scaly=y, smooth=s\\n* ...\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeQG4LGa\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def load_attribute_values(filename):\\n '''Returns a list of attribute values in a file.\\n \\n The attribute values are represented as dictionaries, wherein the keys are abbreviations and the values are descriptions.\\n filename is expected to have one attribute name and set of values per line, with the following format:\\n name: value_description=value_abbreviation[,value_description=value_abbreviation]*\\n for example\\n cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\\n The attribute value description dictionary created from this line would be the following:\\n {'c': 'conical', 'b': 'bell', 'f': 'flat', 'k': 'knobbed', 's': 'sunken', 'x': 'convex'}'''\\n attribute_values = []\\n with open(filename) as f:\\n for line in f:\\n attribute_name_and_value_string_list = line.strip().split(':')\\n attribute_name = attribute_name_and_value_string_list[0]\\n if len(attribute_name_and_value_string_list) < 2:\\n attribute_values.append({}) # no values for this attribute\\n else:\\n value_abbreviation_description_dict = {}\\n description_and_abbreviation_string_list = attribute_name_and_value_string_list[1].strip().split(',')\\n for description_and_abbreviation_string in description_and_abbreviation_string_list:\\n description_and_abbreviation = description_and_abbreviation_string.strip().split('=')\\n description = description_and_abbreviation[0]\\n if len(description_and_abbreviation) < 2: # assumption: no more than 1 value is missing an abbreviation\\n value_abbreviation_description_dict[None] = description\\n else:\\n abbreviation = description_and_abbreviation[1]\\n value_abbreviation_description_dict[abbreviation] = description\\n attribute_values.append(value_abbreviation_description_dict)\\n return attribute_values\\n\\nattribute_filename = 'agaricus-lepiota.attributes'\\nattribute_values = load_attribute_values(attribute_filename)\\nprint 'Read', len(attribute_values), 'attribute values from', attribute_filename\\nprint 'First attribute values list:', attribute_values[0]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":35},{\"id\":\"textmYDS8f\",\"type\":\"text\",\"body\":\"<h3>Exercise 3: define load_attribute_values()<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownLyXNqu\",\"type\":\"markdown\",\"body\":\"We earlier created theattribute_nameslist manually. Theload_attribute_values()function above creates theattribute_valueslist automatically from the contents of a file ... each line of which starts with the name of each attribute ... which we discard.\\n\\nComplete the following function definition so that the code implements the functionality described in the docstring.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code2YaDCx\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def load_attribute_names_and_values(filename):\\n '''Returns a list of attribute names and values in a file.\\n \\n This list contains dictionaries wherein the keys are names \\n and the values are value description dictionariess.\\n \\n Each value description sub-dictionary will use the attribute value abbreviations as its keys \\n and the attribute descriptions as the values.\\n \\n filename is expected to have one attribute name and set of values per line, with the following format:\\n name: value_description=value_abbreviation[,value_description=value_abbreviation]*\\n for example\\n cap-shape: bell=b, conical=c, convex=x, flat=f, knobbed=k, sunken=s\\n The attribute name and values dictionary created from this line would be the following:\\n {'name': 'cap-shape', 'values': {'c': 'conical', 'b': 'bell', 'f': 'flat', 'k': 'knobbed', 's': 'sunken', 'x': 'convex'}}'''\\n attribute_names_and_values = [] # this will be a list of dicts\\n # your code goes here\\n return attribute_names_and_values\\n\\nattribute_filename = 'agaricus-lepiota.attributes'\\n# to test your function, delete the 'simple_ml.' module specification in the call to load_attribute_names_and_values() below\\nattribute_names_and_values = simple_ml.load_attribute_names_and_values(attribute_filename)\\nprint 'Read', len(attribute_names_and_values), 'attribute values from', attribute_filename\\nprint 'First attribute name:', attribute_names_and_values[0]['name'], '; values:', attribute_names_and_values[0]['values']\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":24},{\"id\":\"textOCZ3cc\",\"type\":\"text\",\"body\":\"<h3>Counting<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdown0p8T1f\",\"type\":\"markdown\",\"body\":\"Data scientists often need to count things. For example, we might want to count the numbers of edible and poisonous mushrooms in the *clean_instances* list we created earlier.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeOSxCgQ\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"edible_count = 0\\nfor instance in clean_instances:\\n if instance[0] == 'e':\\n edible_count += 1 # this is shorthand for edible_count = edible_count + 1\\n\\nprint 'There are', edible_count, 'edible mushrooms among the', len(clean_instances), 'clean instances'\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":6},{\"id\":\"markdownBbiQG7\",\"type\":\"markdown\",\"body\":\"More generally, we often want to count the number of occurrences (frequencies) of each possible value for an attribute. One way to do so is to create a dictionary where each dictionary key is an attribute value and each dictionary value is the count of instances with that attribute value.\\n\\nUsing an ordinary dictionary, we must be careful to create a new dictionary entry the first time we see a new attribute value (that is not already contained in the dictionary).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code9nGbAn\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"cap_state_value_counts = {}\\nfor instance in clean_instances:\\n cap_state_value = instance[1] # cap-state is the 2nd attribute\\n if cap_state_value not in cap_state_value_counts:\\n cap_state_value_counts[cap_state_value] = 0\\n cap_state_value_counts[cap_state_value] += 1\\n\\nprint 'Counts for each value of cap-state:'\\nfor value in cap_state_value_counts:\\n print value, ':', cap_state_value_counts[value]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":10},{\"id\":\"markdownCPdfkw\",\"type\":\"markdown\",\"body\":\"The Python [collections](http:\/\/docs.python.org\/2\/library\/collections.html) module provides a number of high performance container datatypes. A frequently useful datatype is a [defaultdict](http:\/\/docs.python.org\/2\/library\/collections.html#defaultdict-objects), which automatically creates an appropriate default value for a new key. For example, adefaultdict(int)automatically initializes a new dictionary entry to 0 (zero); adefaultdict(list)automatically initializes a new dictionary entry to the empty list ([]).\\n\\nAfter first importingdefaultdictfromcollections, we can usedefaultdict(int)to simplify the code above:\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeqRluKA\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"from collections import defaultdict # don't need to use collections.defaultdict() below\\n\\ncap_state_value_counts = defaultdict(int)\\nfor instance in clean_instances:\\n cap_state_value = instance[1]\\n cap_state_value_counts[cap_state_value] += 1\\n\\nprint 'Counts for each value of cap-state:'\\nfor value in cap_state_value_counts:\\n print value, ':', cap_state_value_counts[value]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":10},{\"id\":\"textSS56FG\",\"type\":\"text\",\"body\":\"<h3>Exercise 4: define attribute_value_counts()<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdowngsqazX\",\"type\":\"markdown\",\"body\":\"Define a function,attribute_value_counts(instances, attribute, attribute_names), that returns adefaultdictcontaining the counts of occurrences of each value ofattributein the list ofinstances.attribute_namesis the list we created above, where each element is the name of an attribute.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codebCDvlO\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"# your definition goes here\\n\\nattribute = 'cap-shape'\\n# remove 'simple_ml.' below to test your function definition\\nattribute_value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)\\n\\nprint 'Counts for each value of', attribute, ':'\\nfor value in attribute_value_counts:\\n print value, ':', attribute_value_counts[value]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":9},{\"id\":\"textA7fNTh\",\"type\":\"text\",\"body\":\"<h3>Sorting<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownj8jYOu\",\"type\":\"markdown\",\"body\":\"Earlier, we saw that there is alist.sort()method that will sort a list in-place, i.e., by replacing the original value oflistwith a sorted version of the elements inlist. \\n\\nThe Python [sorted(iterable[, cmp[, key[, reverse]]])](http:\/\/docs.python.org\/2\/library\/functions.html#sorted) function can be used to return a *copy* of a list, dictionary or any other [*iterable*](http:\/\/docs.python.org\/2\/glossary.html#term-iterable) container it is passed, in ascending order.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codemnyMa3\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"original_list = [3, 1, 4, 2, 5]\\nsorted_list = sorted(original_list)\\nprint original_list\\nprint sorted_list\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":4},{\"id\":\"markdownPRhjzu\",\"type\":\"markdown\",\"body\":\"Since it returns a copy,sorted()can be used with strings.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code3pCIte\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print sorted('python')\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownEgXSkJ\",\"type\":\"markdown\",\"body\":\"sorted()can also be used with dictionaries (it returns a sorted list of the dictionary *keys*).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeALeaJy\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print sorted(attribute_values_cap_type) # returns a list of sorted keys (but not values) in the dictionary\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownzYx67s\",\"type\":\"markdown\",\"body\":\"However, we can use the sorted *keys* to access the *values* of a dictionary.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeRFQuVJ\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"for attribute_value_abbrev in sorted(attribute_values_cap_type):\\n print attribute_value_abbrev, '=', attribute_values_cap_type[attribute_value_abbrev]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownMdlxQz\",\"type\":\"markdown\",\"body\":\"An optional [keyword argument](http:\/\/docs.python.org\/2\/tutorial\/controlflow.html#keyword-arguments),reverse, can be used to reverse the order of the sorted list returned by the function. The default value of this optional parameter isFalse, to get non-default behavior, we must specify the name and value of the argument:reverse=True. \",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code4zGzwq\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print sorted([3, 1, 4, 2, 5], reverse=True)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"codeQdo1N4\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print sorted(attribute_values_cap_type, reverse=True) \"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"codeE2Blcx\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute = 'cap-shape'\\nattribute_value_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)\\n\\nprint 'Counts for each value of', attribute, ':'\\nfor value in sorted(attribute_value_counts):\\n print value, ':', attribute_value_counts[value]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":6},{\"id\":\"textEtGVxr\",\"type\":\"text\",\"body\":\"<h3>Sorting a dictionary by values<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdown2XOzjx\",\"type\":\"markdown\",\"body\":\"We often want to sort a dictionary by its *values* rather than its *keys*. \\n\\nFor example, when we printed out the counts of the attribute values forcap-shapeabove, the counts appeared in an ascending alphabetic order of their attribute names. It is often more helpful to show the attribute value counts in descending order of the counts (which are the values in that dictionary).\\n\\nThere are a [variety of ways to sort a dictionary by values](http:\/\/writeonly.wordpress.com\/2008\/08\/30\/sorting-dictionaries-by-value-in-python-improved\/), but the approach described in [PEP-256](http:\/\/legacy.python.org\/dev\/peps\/pep-0265\/) is generally considered the most efficient.\\n\\nIn order to understand the components used in this approach, we will revisit and elaborate on a few concepts involving *dictionaries*, *iterators* and *modules*.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"markdown70vUmj\",\"type\":\"markdown\",\"body\":\"The [dict.items()](http:\/\/docs.python.org\/2\/library\/stdtypes.html#dict.items) method returns an unordered list of(key, value)tuples indict.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code9D0q4X\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute_values_cap_type.items()\"},\"output\":{\"selectedType\":\"Text\",\"result\":\"[('c', 'conical'),\\n\"},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownuNGLtS\",\"type\":\"markdown\",\"body\":\"A related method, [dict.iteritems()](http:\/\/docs.python.org\/2\/library\/stdtypes.html#dict.iteritems), returns an [iterator](http:\/\/docs.python.org\/2\/library\/stdtypes.html#iterator-types) - a callable object that returns the *next* item in a sequence each time it is referenced (e.g., during each iteration of a for loop), which can be more efficient than generating *all* the items in the sequence before any are used. This is similar to the distinction betweenxrange()andrange()described above.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeLi89vD\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute_values_cap_type.iteritems()\"},\"output\":{\"selectedType\":\"Html\",\"result\":\"<dictionary-itemiterator at 0x108e1adb8>\"},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"codezqsv4w\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"for key, value in attribute_values_cap_type.iteritems():\\n print key, value\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownJMQIPD\",\"type\":\"markdown\",\"body\":\"The Python [operator](http:\/\/docs.python.org\/2\/library\/operator.html) module contains a number of functions that perform object comparisons, logical operations, mathematical operations, sequence operations, and abstract type tests.\\n\\nTo facilitate sorting a dictionary by values, we will use the [operator.itemgetter(item)](http:\/\/docs.python.org\/2\/library\/operator.html#operator.itemgetter) function that can be used to retrieve an indexed value (item) in a tuple (such as a(key, value)pair returned by[iter]items()).\\n\\nWe can useoperator.itemgetter(1)) to reference the value - the 2nd item in each(key, value)tuple, (at zero-based index position 1) - rather than the key - the first item in each(key, value)tuple (at index position 0).\\n\\nWe will use the optional keyword argumentkeyin [sorted(iterable[, cmp[, key[, reverse]]])](http:\/\/docs.python.org\/2\/library\/functions.html#sorted) to specify a *sorting* key that is not the same as thedictkey (thedictkey is the default *sorting* key)\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code4HhO7A\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"import operator\\n\\nsorted(attribute_values_cap_type.iteritems(), key=operator.itemgetter(1))\"},\"output\":{\"selectedType\":\"Text\",\"result\":\"[('b', 'bell'),\\n\"},\"evaluatorReader\":false,\"lineCount\":3},{\"id\":\"markdownl0IgKn\",\"type\":\"markdown\",\"body\":\"We can now sort the counts of attribute values in descending frequency of occurrence, and print them out using tuple unpacking.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeRXkCEe\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"attribute = 'cap-shape'\\nvalue_counts = simple_ml.attribute_value_counts(clean_instances, attribute, attribute_names)\\n\\nprint 'Counts for each value of', attribute, ':'\\nfor value, count in sorted(value_counts.iteritems(), key=operator.itemgetter(1), reverse=True):\\n print value, ':', count\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":6},{\"id\":\"textTKlOyX\",\"type\":\"text\",\"body\":\"<h3>Exercise 5: define print_all_attribute_value_counts()<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownC739bT\",\"type\":\"markdown\",\"body\":\"Define a function,print_all_attribute_value_counts(instances, attribute_names), that prints each attribute name inattribute_names, and then for each attribute value, prints the value abbreviation, the count of occurrences of that value and the proportion of instances that have that attribute value.\\n\\nYou may find it helpful to use [fancier output formatting](http:\/\/docs.python.org\/2\/tutorial\/inputoutput.html#fancier-output-formatting). More details can be found in the Python documentation on [format string syntax](http:\/\/docs.python.org\/2\/library\/string.html#format-string-syntax).\\n\\nExamples of thestr.format()function used in conjunction with print statements is shown below, followed by sample output of thesimple_mlversion ofprint_all_attribute_value_counts()(which uses similar formatting, but without hard-coded values).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codePXbTrG\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print 'Output of a sample line using str.format():'\\nprint 'class:', # comma at end keeps cursor on the same line for subsequent print statements\\nprint '{} = {} ({:5.3f}),'.format('e', 3488, 3488 \/ 5644.0),\\nprint '{} = {} ({:5.3f}),'.format('p', 2156, 2156 \/ 5644.0),\\nprint # a print statement with no arguments will advance the cursor to the beginning of the next line\\nprint 'End of sample line'\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":6},{\"id\":\"markdown7bvyjw\",\"type\":\"markdown\",\"body\":\"Define your version ofprint_all_attribute_value_counts(instances, attribute_names)below, deleting thesimple_ml.module specification when you are ready to test your function.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codedKaE2T\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"# your function definition goes here\\n\\nprint '\\\\nCounts for all attributes and values:\\\\n'\\nsimple_ml.print_all_attribute_value_counts(clean_instances, attribute_names)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":4},{\"id\":\"textd8PHdz\",\"type\":\"text\",\"body\":\"<h2>4. Using Python to Build and Use a Simple Decision Tree Classifier<\/h2>\",\"evaluatorReader\":false},{\"id\":\"textQTWSY6\",\"type\":\"text\",\"body\":\"<h3>Decision Trees<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownkV85Fp\",\"type\":\"markdown\",\"body\":\"Wikipedia offers the following description of a [decision tree](https:\/\/en.wikipedia.org\/wiki\/Decision_tree) (with italics added to emphasize terms that will be elaborated below):\\n\\n> A decision tree is a flowchart-like structure in which each *internal node* represents a *test* of an *attribute*, each branch represents an *outcome* of that test and each *leaf node* represents *class label* (a decision taken after testing all attributes in the path from the root to the leaf). Each path from the root to a leaf can also be represented as a classification rule.\\n\\nThe image below depicts a decision tree created from the UCI mushroom dataset that appears on [Andy G's blog post about Decision Tree Learning](http:\/\/gieseanw.wordpress.com\/2012\/03\/03\/decision-tree-learning\/), where \\n\\n* a white box represents an *internal node* (and the label represents the *attribute* being tested)\\n* a blue box represents an attribute value (an *outcome* of the *test* of that attribute)\\n* a green box represents a *leaf node* with a *class label* of *edible*\\n* a red box represents a *leaf node* with a *class label* of *poisonous*\\n\\n<img src=\\\"http:\/\/gieseanw.files.wordpress.com\/2012\/03\/mushroomdecisiontree.png\\\" style=\\\"width: 800px;\\\" \/>\\n\\nIt is important to note that the UCI mushroom dataset consists entirely of [categorical variables](https:\/\/en.wikipedia.org\/wiki\/Categorical_variable), i.e., every variable (or *attribute*) has an enumerated set of possible values. Many datasets include numeric variables that can take on int or float values. Tests for such variables typically use comparison operators, e.g., $age < 65$ or $36,250 < adjusted\\\\_gross\\\\_income <= 87,850$. *[Aside: Python supports boolean expressions containing multiple comparison operators, such as the expression comparing adjusted_gross_income in the preceding example.]*\\n\\nOur simple decision tree will only accommodate categorical variables. We will closely follow a version of the [decision tree learning algorithm implementation](http:\/\/www.onlamp.com\/pub\/a\/python\/2006\/02\/09\/ai_decision_trees.html?page=3) offered by Chris Roach.\\n\\nOur goal in the following sections is to use Python to\\n\\n* *create* a simple decision tree based on a set of training instances\\n* *classify* (predict class labels for) for an instance using a simple decision tree\\n* *evaluate* the performance of the simple decision tree on classifying a set of test instances\\n\\nFirst, we will explore some concepts and algorithms used in building and using decision trees.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textrxGSsL\",\"type\":\"text\",\"body\":\"<h3>Entropy<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownOA0EVe\",\"type\":\"markdown\",\"body\":\"When building a supervised classification model, the frequency distribution of attribute values is a potentially important factor in determining the relative importance of each attribute at various stages in the model building process.\\n\\nIn data modeling, we can use frequency distributions to compute ***entropy***, a measure of disorder (impurity) in a set.\\n\\nWe compute the entropy of multiplying the proportion of instances with each class label by the log of that proportion, and then taking the negative sum of those terms.\\n\\nMore precisely, for a 2-class (binary) classification task:\\n\\n$entropy(S) = - p_1 log_2 (p_1) - p_2 log_2 (p_2)$\\n\\nwhere $p_i$ is proportion (relative frequency) of class *i* within the set *S*.\\n\\nFrom the output above, we know that the proportion ofclean_instancesthat are labeled'e'(classedible) in the UCI dataset is $3488 \\\\div 5644 = 0.618$, and the proportion labeled'p'(classpoisonous) is $2156 \\\\div 5644 = 0.382$.\\n\\nAfter importing the Python [math](http:\/\/docs.python.org\/2\/library\/math.html) module, we can use the [math.log(x[, base])](http:\/\/docs.python.org\/2\/library\/math.html#math.log) function in computing the entropy of theclean_instancesof the UCI mushroom data set as follows:\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code16ZeFN\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"import math\\nentropy = - (3488 \/ 5644.0) * math.log(3488 \/ 5644.0, 2) - (2156 \/ 5644.0) * math.log(2156 \/ 5644.0, 2)\\nprint entropy\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":3},{\"id\":\"texts63JCG\",\"type\":\"text\",\"body\":\"<h3>Exercise 6: define entropy()<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdowneb4yfa\",\"type\":\"markdown\",\"body\":\"Define a function,entropy(instances), that computes the entropy ofinstances. You may assume the class label is in position 0; we will later see how to specify default parameter values in function definitions.\\n\\n[Note: the class label in many data files is the *last* rather than the *first* item on each line.]\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code4UWJRQ\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"# your function definition here\\n\\n# delete 'simple_ml.' below to test your function\\nprint simple_ml.entropy(clean_instances)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":4},{\"id\":\"textO4s9Bd\",\"type\":\"text\",\"body\":\"<h3>Information Gain<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownvIP8OS\",\"type\":\"markdown\",\"body\":\"Informally, a decision tree is constructed using a recursive algorithm that \\n\\n* selects the *best* attribute \\n* splits the set into subsets based on the values of that attribute (each subset is composed of instances from the original set that have the same value for that attribute)\\n* repeats the process on each of these subsets until a stopping condition is met (e.g., a subset has no instances or has instances which all have the same class label)\\n\\nEntropy is a metric that can be used in selecting the best attribute for each split: the best attribute is the one resulting in the *largest decrease in entropy* for a set of instances. [Note: other metrics can be used for determining the best attribute]\\n\\n*Information gain* measures the decrease in entropy that results from splitting a set of instances based on an attribute.\\n\\n$IG(S, a) = entropy(S) - [p(s_1) × entropy(s_1) + p(s_2) × entropy(s_2) ... + p(s_n) × entropy(s_n)]$\\n\\nWhere $n$ is the number of distinct values of attribute $a$, and $s_i$ is the subset of $S$ where all instances have the $i$th value of $a$.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeQQZheF\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print 'Information gain for different attributes:\\\\n'\\nfor i in range(1, len(attribute_names)):\\n print '{:5.3f} {:2} {}'.format(simple_ml.information_gain(clean_instances, i), i, attribute_names[i])\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":3},{\"id\":\"markdownEhmYRA\",\"type\":\"markdown\",\"body\":\"We can sort the attributes based in decreasing order of information gain.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeh83nRO\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print 'Information gain for different attributes:\\\\n'\\nsorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) for i in range(1, len(attribute_names))], \\n reverse=True)\\nprint sorted_information_gain_indexes, '\\\\n'\\n\\nfor gain, i in sorted_information_gain_indexes:\\n print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":7},{\"id\":\"markdownc8iQYz\",\"type\":\"markdown\",\"body\":\"The following variation does not use a list comprehension:\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codepO37O4\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"print 'Information gain for different attributes:\\\\n'\\n\\ninformation_gain_values = []\\nfor i in range(1, len(attribute_names)):\\n information_gain_values.append((simple_ml.information_gain(clean_instances, i), i))\\n \\nsorted_information_gain_indexes = sorted(information_gain_values, \\n reverse=True)\\nprint sorted_information_gain_indexes, '\\\\n'\\n\\nfor gain, i in sorted_information_gain_indexes:\\n print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":12},{\"id\":\"textZCxSt4\",\"type\":\"text\",\"body\":\"<h3>Exercise 7: define information_gain()<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownnQODiQ\",\"type\":\"markdown\",\"body\":\"Define a function,information_gain(instances, i), that returns the information gain achieved by selecting theith attribute to splitinstances. It should exhibit the same behavior as thesimple_mlversion of the function.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codexevQ8n\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"# your definition of information_gain(instances, i) here\\n\\n# delete 'simple_ml.' below to test your function\\nsorted_information_gain_indexes = sorted([(simple_ml.information_gain(clean_instances, i), i) for i in range(1, len(attribute_names))], \\n reverse=True)\\n\\nprint 'Information gain for different attributes:\\\\n'\\nfor gain, i in sorted_information_gain_indexes:\\n print '{:5.3f} {:2} {}'.format(gain, i, attribute_names[i])\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":9},{\"id\":\"textQPYudY\",\"type\":\"text\",\"body\":\"<h3>Building a Simple Decision Tree<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownMoSujA\",\"type\":\"markdown\",\"body\":\"We will implement a modified version of the [ID3](https:\/\/en.wikipedia.org\/wiki\/ID3_algorithm) algorithm for building a simple decision tree.\\n\\n ID3 (Examples, Target_Attribute, Attributes)\\n Create a root node for the tree\\n If all examples are positive, Return the single-node tree Root, with label = +.\\n If all examples are negative, Return the single-node tree Root, with label = -.\\n If number of predicting attributes is empty, then Return the single node tree Root,\\n with label = most common value of the target attribute in the examples.\\n Otherwise Begin\\n A ← The Attribute that best classifies examples.\\n Decision Tree attribute for Root = A.\\n For each possible value, v_i, of A,\\n Add a new tree branch below Root, corresponding to the test A = v_i.\\n Let Examples(v_i) be the subset of examples that have the value v_i for A\\n If Examples(v_i) is empty\\n Then below this new branch add a leaf node with label = most common target value in the examples\\n Else below this new branch add the subtree ID3 (Examples(v_i), Target_Attribute, Attributes \u2013 {A})\\n End\\n Return Root\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"markdownLMNsPw\",\"type\":\"markdown\",\"body\":\"In building a decision tree, we will need to split the instances based on the index of the *best* attribute, i.e., the attribute that offers the *highest information gain*. We will use separate utility functions to handle these subtasks. To simplify the functions, we will rely exclusively on attribute indexes rather than attribute names.\\n\\n***Note:*** the algorithm above is *recursive*, i.e., the there is a recursive call toID3within the definition ofID3. Covering recursion is beyond the scope of this primer, but there are a number of other resources on [using recursion in Python](https:\/\/www.google.com\/search?q=python+recursion). Familiarity with recursion will be important for understanding both the tree construction and classification functions below.\\n\\nFirst, we will define a function to split a set of instances based on any attribute. This function will return a dictionary where the *key* of each dictionary is a distinct value of the specifiedattribute_index, and the *value* of each dictionary is a list representing the subset ofinstancesthat have that attribute value.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code0UfLI8\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def split_instances(instances, attribute_index):\\n '''Returns a list of dictionaries, splitting a list of instances according to their values of a specified attribute''\\n \\n The key of each dictionary is a distinct value of attribute_index,\\n and the value of each dictionary is a list representing the subset of instances that have that value for the attribute'''\\n partitions = defaultdict(list)\\n for instance in instances:\\n partitions[instance[attribute_index]].append(instance)\\n return partitions\\n\\npartitions = split_instances(clean_instances, 5)\\nprint [(partition, len(partitions[partition])) for partition in partitions]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":12},{\"id\":\"markdownymZTON\",\"type\":\"markdown\",\"body\":\"Now that we can split instances based on a particular attribute, we would like to be able to choose the *best* attribute with which to split the instances, where *best* is defined as the attribute that provides the greatest information gain if instances were split based on that attribute. We will want to restrict the candidate attributes so that we don't bother trying to split on an attribute that was used higher up in the decision tree (or use the target attribute as a candidate).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"textVsliWm\",\"type\":\"text\",\"body\":\"<h3>Exercise 8: define choose_best_attribute_index()<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownHPHmXX\",\"type\":\"markdown\",\"body\":\"Define a function,choose_best_attribute_index(instances, candidate_attribute_indexes), that returns the index in the list ofcandidate_attribute_indexesthat provides the highest information gain ifinstancesare split based on that attribute index.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeMj1tCC\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"# your function here\\n\\n# delete 'simple_ml.' below to test your function:\\nprint 'Best attribute index:', simple_ml.choose_best_attribute_index(clean_instances, range(1, len(attribute_names)))\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":4},{\"id\":\"markdownF6T2AB\",\"type\":\"markdown\",\"body\":\"A leaf node in a decision tree represents the most frequently occurring - or majority - class value for that path through the tree. We will need a function that determines the majority value for the class index among a set of instances.\\n\\nWe earlier saw how the [defaultdict](http:\/\/docs.python.org\/2\/library\/collections.html#collections.defaultdict) container in the [collections](http:\/\/docs.python.org\/2\/library\/collections.html) module can be used to simplify the construction of a dictionary containing the counts of all attribute values for all attributes, by automatically setting the count for any attribute value to zero when the attribute value is first added to the dictionary.\\n\\nThecollectionsmodule has another useful container, a [Counter](http:\/\/docs.python.org\/2\/library\/collections.html#collections.Counter) class, that can further simplify the construction of a specialized dictionary of counts. When aCounterobject is instantiated with a list of items, it returns a dictionary-like container in which the *keys* are the unique items in the list, and the *values* are the counts of each unique item in that list. \\n\\nThis container has an additional method, [most_common([n])](http:\/\/docs.python.org\/2\/library\/collections.html#collections.Counter.most_common), which returns a list of 2-element tuples representing the values and their associated counts for the most commonnvalues; ifnis omitted, the method returns all tuples.\\n\\nThe following is an example of how we can use aCounterto represent the frequency of different class labels, and how we can identify the most frequent value and its count.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeNk27h8\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"from collections import Counter\\n\\nclass_counts = Counter([instance[0] for instance in clean_instances])\\nprint 'class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(\\n class_counts, # the Counter object\\n class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count\\n class_counts.most_common(1)[0][0]) # the most common value (1st element in that tuple)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":7},{\"id\":\"markdowndl0n7V\",\"type\":\"markdown\",\"body\":\"The following variation does not use a list comprehension:\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeGJLHsc\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"class_values = []\\nfor instance in clean_instances:\\n class_values.append(instance[0])\\n \\nclass_counts = Counter(class_values)\\nprint 'class_counts: {}; most_common(1): {}, most_common(1)[0][0]: {}'.format(\\n class_counts, # the Counter object\\n class_counts.most_common(1), # returns a list in which the 1st element is a tuple with the most common value and its count\\n class_counts.most_common(1)[0][0]) # the most common value (1st element in that tuple)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":9},{\"id\":\"markdownAZDgfG\",\"type\":\"markdown\",\"body\":\"Before putting all this together to define a decision tree construction function, it may be helpful to cover a few additional aspects of Python the function will utilize.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"markdownfueuRo\",\"type\":\"markdown\",\"body\":\"Python offers a very flexible mechanism for the [testing of truth values](http:\/\/python.net\/~goodger\/projects\/pycon\/2007\/idiomatic\/handout.html#testing-for-truth-values): in an **if** condition, any null object, zero-valued numerical expression or empty container (string, list, dictionary or tuple) is interpreted as *False* (i.e., *not True*):\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeNXfE9t\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"for x in [False, None, 0, 0.0, \\\"\\\", [], {}, ()]:\\n print '\\\"{}\\\" is'.format(x),\\n if x:\\n print True\\n else:\\n print False\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":6},{\"id\":\"markdownwfYeY7\",\"type\":\"markdown\",\"body\":\"Python also offers a [conditional expression (ternary operator)](http:\/\/docs.python.org\/2\/reference\/expressions.html#conditional-expressions) that allows the functionality of an if\/else statement that returns a value to be implemented as an expression. For example, the if\/else statement in the code above could be implemented as a conditional expression as follows:\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeC4xEre\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"for x in [False, None, 0, 0.0, \\\"\\\", [], {}, ()]:\\n print '\\\"{}\\\" is {}'.format(x, True if x else False) # using conditional expression as second argument to format()\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownJ6QLhV\",\"type\":\"markdown\",\"body\":\"Python function definitions can specify [default parameter values](http:\/\/docs.python.org\/2\/tutorial\/controlflow.html#default-argument-values) indicating the value those parameters will have if no argument is explicitly provided when the function is called. Arguments can also be passed using [keyword parameters](http:\/\/docs.python.org\/2\/tutorial\/controlflow.html#keyword-arguments) indicting which parameter will be assigned a specific argument value (which may or may not correspond to the order in which the parameters are defined).\\n\\nThe [Python Tutorial page on default parameters](http:\/\/docs.python.org\/2\/tutorial\/controlflow.html#default-argument-values) includes the following warning:\\n\\n> Important warning: The default value is evaluated only once. This makes a difference when the default is a mutable object such as a list, dictionary, or instances of most classes. \\n\\nThus it is generally better to use the Python null object,None, rather than an emptylist([]),dict({}) or other mutable data structure when specifying default parameter values for any of those data types.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeylKfEx\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def parameter_test(parameter1=None, parameter2=None):\\n '''Prints the values of parameter1 and parameter2'''\\n print 'parameter1: {}; parameter2: {}'.format(parameter1, parameter2)\\n \\nparameter_test() # no args are required\\nparameter_test(1) # if any args are provided, 1st arg gets assigned to parameter1\\nparameter_test(1, 2) # 2nd arg gets assigned to parameter2\\nparameter_test(2) # remember: if only 1 arg, 1st arg gets assigned to arg1\\nparameter_test(parameter2=2) # can use keyword to [only] provide an explicit value for parameter2\\nparameter_test(parameter2=2, parameter1=1) # can use keywords for either arg, in either order\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":10},{\"id\":\"texthCrTor\",\"type\":\"text\",\"body\":\"<h3>Exercise 9: define majority_value()<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownQPo08y\",\"type\":\"markdown\",\"body\":\"Define a function,majority_value(instances, class_index), that returns the most frequently occurring value ofclass_indexininstances. Theclass_indexparameter should be optional, and have a default value of0(zero).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"coder4PrU0\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"# your definition of majority_value(instances) here\\n\\n# delete 'simple_ml.' below to test your function:\\nprint 'Majority value of index {}: {}'.format(0, simple_ml.majority_value(clean_instances)) # note: relying on default parameter here\\n# although there is only one class_index for the dataset, we'll test it by providing non-default values\\nprint 'Majority value of index {}: {}'.format(1, simple_ml.majority_value(clean_instances, 1)) # using an optional 2nd argument\\nprint 'Majority value of index {}: {}'.format(2, simple_ml.majority_value(clean_instances, class_index=2)) # using a keyword\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":7},{\"id\":\"markdownrqJ2wi\",\"type\":\"markdown\",\"body\":\"The recursivecreate_decision_tree()function below uses an optional parameter,class_index, which defaults to0. This is to accommodate other datasets in which the class label is the last element on each line (which would be most easily specified by using a-1value). Most data files in the [UCI Machine Learning Repository](https:\/\/archive.ics.uci.edu\/ml\/datasets.html) have the class labels as either the first element or the last element.\\n\\nTo show how the decision tree is being built, an optionaltraceparameter, when non-zero, will generate some trace information as the tree is constructed. The indentation level is incremented with each recursive call via the use of the conditional expression (ternary operator),trace + 1 if trace else 0.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codewljvKy\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def create_decision_tree(instances, candidate_attribute_indexes=None, class_index=0, default_class=None, trace=0):\\n '''Returns a new decision tree trained on a list of instances.\\n \\n The tree is constructed by recursively selecting and splitting instances based on \\n the highest information_gain of the candidate_attribute_indexes.\\n \\n The class label is found in position class_index.\\n \\n The default_class is the majority value for the current node's parent in the tree.\\n A positive (int) trace value will generate trace information with increasing levels of indentation.\\n \\n Derived from the simplified ID3 algorithm presented in Building Decision Trees in Python by Christopher Roach,\\n http:\/\/www.onlamp.com\/pub\/a\/python\/2006\/02\/09\/ai_decision_trees.html?page=3'''\\n \\n # if no candidate_attribute_indexes are provided, assume that we will use all but the target_attribute_index\\n if candidate_attribute_indexes is None:\\n candidate_attribute_indexes = range(len(instances[0]))\\n candidate_attribute_indexes.remove(class_index)\\n \\n class_labels_and_counts = Counter([instance[class_index] for instance in instances])\\n\\n # If the dataset is empty or the candidate attributes list is empty, return the default value\\n if not instances or not candidate_attribute_indexes:\\n if trace:\\n print '{}Using default class {}'.format('< ' * trace, default_class)\\n return default_class\\n \\n # If all the instances have the same class label, return that class label\\n elif len(class_labels_and_counts) == 1:\\n class_label = class_labels_and_counts.most_common(1)[0][0]\\n if trace:\\n print '{}All {} instances have label {}'.format('< ' * trace, len(instances), class_label)\\n return class_label\\n else:\\n default_class = simple_ml.majority_value(instances, class_index)\\n\\n # Choose the next best attribute index to best classify the instances\\n best_index = simple_ml.choose_best_attribute_index(instances, candidate_attribute_indexes, class_index) \\n if trace:\\n print '{}Creating tree node for attribute index {}'.format('> ' * trace, best_index)\\n\\n # Create a new decision tree node with the best attribute index and an empty dictionary object (for now)\\n tree = {best_index:{}}\\n\\n # Create a new decision tree sub-node (branch) for each of the values in the best attribute field\\n partitions = simple_ml.split_instances(instances, best_index)\\n\\n # Remove that attribute from the set of candidates for further splits\\n remaining_candidate_attribute_indexes = [i for i in candidate_attribute_indexes if i != best_index]\\n for attribute_value in partitions:\\n if trace:\\n print '{}Creating subtree for value {} ({}, {}, {}, {})'.format(\\n '> ' * trace,\\n attribute_value, \\n len(partitions[attribute_value]), \\n len(remaining_candidate_attribute_indexes), \\n class_index, \\n default_class)\\n \\n # Create a subtree for each value of the the best attribute\\n subtree = create_decision_tree(\\n partitions[attribute_value],\\n remaining_candidate_attribute_indexes,\\n class_index,\\n default_class,\\n trace + 1 if trace else 0)\\n\\n # Add the new subtree to the empty dictionary object in the new tree\/node we just created\\n tree[best_index][attribute_value] = subtree\\n\\n return tree\\n\\n# split instances into separate training and testing sets\\ntraining_instances = clean_instances[:-20]\\ntesting_instances = clean_instances[-20:]\\ntree = create_decision_tree(training_instances, trace=1) # remove trace=1 to turn off tracing\\nprint tree\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":77},{\"id\":\"markdownOdxON8\",\"type\":\"markdown\",\"body\":\"The structure of the tree shown above is rather difficult to discern from the normal printed representation of a dictionary.\\n\\nThe Python [pprint](http:\/\/docs.python.org\/2\/library\/pprint.html) module has a number of useful methods for pretty-printing or formatting objects in a more human readable way.\\n\\nThe [pprint.pprint(object, stream=None, indent=1, width=80, depth=None)](http:\/\/docs.python.org\/2\/library\/pprint.html#pprint.pprint) method will printobjectto astream(a default value ofNonewill dictate the use of [sys.stdout](http:\/\/docs.python.org\/2\/library\/sys.html#sys.stdout), the same destination asprintstatement output), usingindentspaces to differentiate nesting levels, using up to a maximumwidthcolumns and up to to a maximum nesting leveldepth(Noneindicating no maximum).\\n\\nWe will use the a variation on the import statement that imports one or more functions into the current namespace:\\n\\n from pprint import pprint\\n \\nThis will to enable us to usepprint()rather than having to use dotted notation, i.e.,pprint.pprint(). \\n\\nNote that if we wanted to define our ownpprint()function, we would be best only using\\n\\n import pprint\\n \\nso that we can still access thepprint()function in thepprintmodule (since definingpprint()in the current namespace would otherwise override the imported definition of the function).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeEHFOKH\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"from pprint import pprint\\npprint(tree)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"textXNNV52\",\"type\":\"text\",\"body\":\"<h3>Classifying Instances with a Simple Decision Tree<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdown523LVK\",\"type\":\"markdown\",\"body\":\"Usually, when we construct a decision tree based on a set of *training* instances, we do so with the intent of using that tree to classify a set of one or more *testing* instances.\\n\\nWe will define a function,classify(tree, instance, default_class=None), to use a decisiontreeto classify a singleinstance, where an optionaldefault_classcan be specified as the return value if the instance represents a set of attribute values that don't have a representation in the decision tree.\\n\\nWe will use a design pattern in which we will use a series ofifstatements, each of which returns a value if the condition is true, rather than a nested series ofif,elifand\/orelseclauses, as it helps constrain the levels of indentation in the function.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeNafKDD\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def classify(tree, instance, default_class=None):\\n '''Returns a classification label for instance, given a decision tree'''\\n if not tree:\\n return default_class\\n if not isinstance(tree, dict): \\n return tree\\n attribute_index = tree.keys()[0]\\n attribute_values = tree.values()[0]\\n instance_attribute_value = instance[attribute_index]\\n if instance_attribute_value not in attribute_values:\\n return default_class\\n return classify(attribute_values[instance_attribute_value], instance, default_class)\\n\\nfor instance in testing_instances:\\n predicted_label = classify(tree, instance)\\n actual_label = instance[0]\\n print 'predicted: {}; actual: {}'.format(predicted_label, actual_label)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":17},{\"id\":\"textnopZGk\",\"type\":\"text\",\"body\":\"<h3>Evaluating the Accuracy of a Simple Decision Tree<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownkUel3r\",\"type\":\"markdown\",\"body\":\"It is often helpful to evaluate the performance of a model using a dataset not used in the training of that model. In the simple example shown above, we used all but the last 20 instances to train a simple decision tree, then classified those last 20 instances using the tree.\\n\\nThe advantage of this training\/testing split is that visual inspection of the classifications (sometimes called *predictions*) is relatively straightforward, revealing that all 20 instances were correctly classified.\\n\\nThere are a variety of metrics that can be used to evaluate the performance of a model. [Scikit Learn's Model Evaluation](http:\/\/scikit-learn.org\/stable\/modules\/model_evaluation.html) library provides an overview and implementation of several possible metrics. For now, we'll simply measure the *accuracy* of a model, i.e., the percentage of testing instances that are correctly classified (*true positives* and *true negatives*).\\n\\nThe accuracy of the model above, given the set of 20 testing instances, is 100% (20\/20).\\n\\nThe function below calculates the classification accuracy of atreeover a set oftesting_instances(with an optionalclass_indexparameter indicating the position of the class label in each instance).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codev9xvTm\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def classification_accuracy(tree, testing_instances, class_index=0, default_class=None):\\n '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index'''\\n num_correct = 0\\n for i in xrange(len(testing_instances)):\\n prediction = classify(tree, testing_instances[i], default_class)\\n actual_value = testing_instances[i][class_index]\\n if prediction == actual_value:\\n num_correct += 1\\n return float(num_correct) \/ len(testing_instances)\\n\\nprint classification_accuracy(tree, testing_instances)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":11},{\"id\":\"markdownxMhxcD\",\"type\":\"markdown\",\"body\":\"The [zip([iterable, ...])](http:\/\/docs.python.org\/2.7\/library\/functions.html#zip) function combines 2 or more sequences or iterables; the function returns a list of tuples, where the *i*th tuple contains the *i*th element from each of the argument sequences or iterables.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codew8S13R\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"zip([0, 1, 2], ['a', 'b', 'c'])\"},\"output\":{\"selectedType\":\"Text\",\"result\":\"[(0, 'a'), (1, 'b'), (2, 'c')]\"},\"evaluatorReader\":false,\"lineCount\":1},{\"id\":\"markdownaxzkzO\",\"type\":\"markdown\",\"body\":\"We can use [list comprehensions](http:\/\/docs.python.org\/2\/tutorial\/datastructures.html#list-comprehensions), theCounterclass and thezip()function to modifyclassification_accuracy()so that it returns a packed tuple with \\n\\n* the number of correctly classified instances\\n* the number of incorrectly classified instances\\n* the percentage of instances correctly classified\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codet18QU8\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def classification_accuracy(tree, instances, class_index=0, default_class=None):\\n '''Returns the accuracy of classifying testing_instances with tree, where the class label is in position class_index'''\\n predicted_labels = [classify(tree, instance, default_class) for instance in instances]\\n actual_labels = [x[class_index] for x in instances]\\n counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\\n return counts[True], counts[False], float(counts[True]) \/ len(instances)\\n\\nprint classification_accuracy(tree, testing_instances)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":8},{\"id\":\"markdownRfPFpU\",\"type\":\"markdown\",\"body\":\"We sometimes want to partition the instances into subsets of equal sizes to measure performance. One metric this partitioning allows us to compute is a [learning curve](https:\/\/en.wikipedia.org\/wiki\/Learning_curve), i.e., assess how well the model performs based on the size of its training set. Another use of these partitions (aka *folds*) would be to conduct an [*n-fold cross validation*](https:\/\/en.wikipedia.org\/wiki\/Cross-validation_(statistics)) evaluation.\\n\\nThe following function,partition_instances(instances, num_partitions), partitions a set ofinstancesintonum_partitionsrelatively equally sized subsets.\\n\\nWe'll use this as yet another opportunity to demonstrate the power of using list comprehensions, this time, to condense the use of nestedforloops.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codepfrjr7\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def partition_instances(instances, num_partitions):\\n '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''\\n return [[instances[j] for j in xrange(i, len(instances), num_partitions)] for i in xrange(num_partitions)]\"},\"output\":{\"result\":\"\"},\"evaluatorReader\":false,\"lineCount\":3},{\"id\":\"markdownHgzHy9\",\"type\":\"markdown\",\"body\":\"Before testing this function on the 5644clean_instancesfrom the UCI mushroom dataset, let's create a small number of simplified instances to verify that the function has the desired behavior.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code27loD3\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"instance_length = 3\\nnum_instances = 5\\n\\nsimplified_instances = [[j for j in xrange(i, instance_length + i)] for i in xrange(num_instances)]\\n\\nprint 'Instances:', simplified_instances\\npartitions = partition_instances(simplified_instances, 2)\\nprint 'Partitions:', partitions\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":8},{\"id\":\"markdownfPKZWH\",\"type\":\"markdown\",\"body\":\"The following variations do not use list comprehensions.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeWRqAVU\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def partition_instances(instances, num_partitions):\\n '''Returns a list of relatively equally sized disjoint sublists (partitions) of the list of instances'''\\n partitions = []\\n for i in xrange(num_partitions):\\n partition = []\\n # iterate over instances starting at position i in increments of num_paritions\\n for j in xrange(i, len(instances), num_partitions): \\n partition.append(instances[j])\\n partitions.append(partition)\\n return partitions\\n\\nsimplified_instances = []\\nfor i in xrange(num_instances):\\n new_instance = []\\n for j in xrange(i, instance_length + i):\\n new_instance.append(j)\\n simplified_instances.append(new_instance)\\n\\nprint 'Instances:', simplified_instances\\npartitions = partition_instances(simplified_instances, 2)\\nprint 'Partitions:', partitions\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":21},{\"id\":\"markdown3wxAc9\",\"type\":\"markdown\",\"body\":\"The [enumerate(sequence, start=0)](http:\/\/docs.python.org\/2.7\/library\/functions.html#enumerate) function creates an iterator that successively returns the index and value of each element in asequence, beginning at thestartindex.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeLBRiLD\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"for i, x in enumerate(['a', 'b', 'c']):\\n print i, x\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdownQMnc9b\",\"type\":\"markdown\",\"body\":\"We can useenumerate()to facilitate slightly more rigorous testing of ourpartition_instancesfunction on oursimplified_instances.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code5RBBaE\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"for i in xrange(5):\\n print '\\\\n# partitions:', i\\n for j, partition in enumerate(partition_instances(simplified_instances, i)):\\n print 'partition {}: {}'.format(j, partition)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":4},{\"id\":\"markdownSkxS0A\",\"type\":\"markdown\",\"body\":\"Returning our attention to the UCI mushroom dataset, the following will partition ourclean_instancesinto 10 relatively equally sized disjoint subsets. We will use a list comprehension to print out the length of each partition\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code2fvoHd\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"partitions = partition_instances(clean_instances, 10)\\nprint [len(partition) for partition in partitions]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"markdown80fzj1\",\"type\":\"markdown\",\"body\":\"The following variation does not use a list comprehension.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codePLBF1A\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"for partition in partitions:\\n print len(partition), # note the comma at the end\\nprint\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":3},{\"id\":\"markdownFtqK7L\",\"type\":\"markdown\",\"body\":\"The following shows the different trees that are constructed based on partition 0 (first 10th) ofclean_instances, partitions 0 and 1 (first 2\/10ths) ofclean_instancesand allclean_instances.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"code0VVJ9w\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"tree0 = create_decision_tree(partitions[0])\\nprint 'Tree trained with {} instances:'.format(len(partitions[0]))\\npprint(tree0)\\n\\ntree1 = create_decision_tree(partitions[0] + partitions[1])\\nprint '\\\\nTree trained with {} instances:'.format(len(partitions[0] + partitions[1]))\\npprint(tree1)\\n\\ntree = create_decision_tree(clean_instances)\\nprint '\\\\nTree trained with {} instances:'.format(len(clean_instances))\\npprint(tree)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":11},{\"id\":\"markdown1YjxUd\",\"type\":\"markdown\",\"body\":\"The only difference between the first two trees - *tree0* and *tree1* - is that in the first tree, instances with noodor(attribute index5is'n') and aspore-print-colorof white (attribute20='w') are classified asedible('e'). With additional training data in the 2nd partition, an additional distinction is made such that instances with noodor, a whitespore-print-colorand a clusteredpopulation(attribute21='c') are classified aspoisonous('p'), while all other instances with noodorand a whitespore-print-color(and any other value for thepopulationattribute) are classified asedible('e').\\n\\nNote that there is no difference betweentree1andtree(the tree trained with all instances). This early convergence on an optimal model is uncommon on most datasets (outside the UCI repository).\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"markdown2WPd6C\",\"type\":\"markdown\",\"body\":\"Now that we can partition our instances into subsets, we can use these subsets to construct different-sized training sets in the process of computing a learning curve.\\n\\nWe will start off with an initial training set consisting only of the first partition, and then progressively extend that training set by adding a new partition during each iteration of computing the learning curve.\\n\\nThe [list.extend(L)](http:\/\/docs.python.org\/2\/tutorial\/datastructures.html#more-on-lists) method enables us to extendlistby appending all the items in another list,L, to the end oflist.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeftNe8o\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"x = [1, 2, 3]\\nx.extend([4, 5])\\nprint x\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":3},{\"id\":\"markdownznHUCr\",\"type\":\"markdown\",\"body\":\"We can now define the function,compute_learning_curve(instances, num_partitions=10), that will take a list ofinstances, partition it intonum_partitionsrelatively equally sized disjoint partitions, and then iteratively evaluate the accuracy of models trained with an incrementally increasing combination of instances in the firstnum_partitions - 1partitions then tested with instances in the last partition. That is, a model trained with the first partition will be constructed (and tested), then a model trained with the first 2 partitions will be constructed (and tested), and so on. \\n\\nThe function will return a list ofnum_partitions - 1tuples representing the size of the training set and the accuracy of a tree trained with that set (and tested on thenum_partitions - 1set). This will provide some indication of the relative impact of the size of the training set on model performance.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeMsKBGM\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"def compute_learning_curve(instances, num_partitions=10):\\n '''Returns a list of training sizes and scores for incrementally increasing partitions.\\n\\n The list contains 2-element tuples, each representing a training size and score.\\n The i-th training size is the number of instances in partitions 0 through num_partitions - 2.\\n The i-th score is the accuracy of a tree trained with instances \\n from partitions 0 through num_partitions - 2\\n and tested on instances from num_partitions - 1 (the last partition).'''\\n \\n partitions = partition_instances(instances, num_partitions)\\n testing_instances = partitions[-1][:]\\n training_instances = []\\n accuracy_list = []\\n for i in xrange(0, num_partitions - 1):\\n # for each iteration, the training set is composed of partitions 0 through i - 1\\n training_instances.extend(partitions[i][:])\\n tree = create_decision_tree(training_instances)\\n partition_accuracy = classification_accuracy(tree, testing_instances)\\n accuracy_list.append((len(training_instances), partition_accuracy))\\n return accuracy_list\\n\\naccuracy_list = compute_learning_curve(clean_instances)\\nprint accuracy_list\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":23},{\"id\":\"markdownEleHei\",\"type\":\"markdown\",\"body\":\"Due to the quick convergence on an optimal decision tree for classifying the UCI mushroom dataset, we can use a larger number of smaller partitions to see a little more variation in acccuracy performance.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codembRimE\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"accuracy_list = compute_learning_curve(clean_instances, 100)\\nprint accuracy_list[:10]\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":2},{\"id\":\"textsfKkUR\",\"type\":\"text\",\"body\":\"<h3>Object-Oriented Programming: Defining a Python Class to Encapsulate a Simple Decision Tree<\/h3>\",\"evaluatorReader\":false},{\"id\":\"markdownSf4wbo\",\"type\":\"markdown\",\"body\":\"The simple decision tree defined above uses a Python dictionary for its representation. One can imagine using other data structures, and\/or extending the decision tree to support confidence estimates, numeric features and other capabilities that are often included in more fully functional implementations. To support future extensibility, and hide the details of the representation from the user, it would be helpful to have a user-defined class for simple decision trees.\\n\\nPython is an [object-oriented programming](https:\/\/en.wikipedia.org\/wiki\/Object-oriented_programming) language, offering simple syntax and semantics for defining classes and instantiating objects of those classes. *[It is assumed that the reader is already familiar with the concepts of object-oriented programming]*\\n\\nA Python [class](http:\/\/docs.python.org\/2\/tutorial\/classes.html) starts with the keyword **class** followed by a class name (identifier), a colon (':'), and then any number of statements, which typically take the form of assignment statements for class or instance variables and\/or function definitions for class methods. All statements are indented to reflect their inclusion in the class definition.\\n\\nThe members - methods, class variables and instance variables - of a class are accessed by prependingself.to each reference. Class methods always includeselfas the first parameter. \\n\\nAll class members in Python are *public* (accessible outside the class). There is no mechanism for *private* class members, but identifiers with leading double underscores (*\\\\_\\\\_member_identifier*) are 'mangled' (translated into *\\\\_class_name\\\\__member_identifier*), and thus not directly accessible outside their class, and can be used to approximate private members by Python programmers. \\n\\nThere is also no mechanism for *protected* identifiers - accessible only within a defining class and its subclasses - in the Python language, and so Python programmers have adopted the convention of using a single underscore (*\\\\_identifier*) at the start of any identifier that is intended to be protected (i.e., not to be accessed outside the class or its subclasses). \\n\\nSome Python programmers only use the single underscore prefixes and avoid double underscore prefixes due to unintended consequences that can arise when names are mangled. The following warning about single and double underscore prefixes is issued in [Code Like a Pythonista](http:\/\/python.net\/~goodger\/projects\/pycon\/2007\/idiomatic\/handout.html#naming):\\n\\n> try to avoid the __private form. I never use it. Trust me. If you use it, you WILL regret it later\\n\\nWe will follow this advice and avoid using the double underscore prefix in user-defined member variables and methods.\\n\\nPython has a number of pre-defined [special method names](http:\/\/docs.python.org\/2\/reference\/datamodel.html#special-method-names), all of which are denoted by leading and trailing double underscores. For example, the [object.init(self[, ...])](http:\/\/docs.python.org\/2\/reference\/datamodel.html#object.__init__) method is used to specify instructions that should be executed whenever a new object of a class is instantiated. \\n\\nThe code below defines a class,SimpleDecisionTree, with a single pseudo-protected member variable_treeand a pseudo-protected tree construction method_create(), two public methods -classify()andpprint()- and an initialization method that takes an optional list of traininginstancesand atarget_attribute_index. \\n\\nThe_create()method is identical to thecreate_decision_tree()function above, with the inclusion of theselfparameter (as it is now a class method). Theclassify()method is a similarly modified version of theclassify()andclassification_accuracy()functions above, with references totreeconverted toself._tree. Thepprint()method prints the tree in a human-readable format.\\n\\nNote that other machine learning libraries may use different terminology for the methods we've defined here. For example, in the [sklearn.tree.DecisionTreeClassifier](http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html) class (and in mostsklearnclassifier classes), the method for constructing a classifier is named [fit()](http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.fit) - since it \\\"fits\\\" the data to a model - and the method for classifying instances is named [predict()](http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict) - since it is predicting the class label for an instance.\\n\\nMost comments and the use of the trace parameter have been removed to make the code more compact, but are included in the version found inSimpleDecisionTree.py.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codeDEBX7L\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"class SimpleDecisionTree:\n\n _tree = {} # this instance variable becomes accessible to class methods via self._tree\n\n def __init__(self, instances=None, target_attribute_index=0): # note the use of self as the first parameter\n if instances:\n self._tree = self._create(instances, range(1, len(instances[0])), target_attribute_index)\n \n def _create(self, instances, candidate_attribute_indexes, target_attribute_index=0, default_class=None):\n class_labels_and_counts = Counter([instance[target_attribute_index] for instance in instances])\n if not instances or not candidate_attribute_indexes:\n return default_class\n elif len(class_labels_and_counts) == 1:\n class_label = class_labels_and_counts.most_common(1)[0][0]\n return class_label\n else:\n default_class = simple_ml.majority_value(instances, target_attribute_index)\n best_index = simple_ml.choose_best_attribute_index(instances, candidate_attribute_indexes, target_attribute_index)\n tree = {best_index:{}}\n partitions = simple_ml.split_instances(instances, best_index)\n remaining_candidate_attribute_indexes = [i for i in candidate_attribute_indexes if i != best_index]\n for attribute_value in partitions:\n subtree = self._create(\n partitions[attribute_value],\n remaining_candidate_attribute_indexes,\n target_attribute_index,\n default_class)\n tree[best_index][attribute_value] = subtree\n return tree\n \n # calls the internal \\"protected\\" method to classify the instance given the _tree\n def classify(self, instance, default_class=None):\n return self._classify(self._tree, instance, default_class)\n \n # a method intended to be \\"protected\\" that can implement the recursive algorithm to classify an instance given a tree\n def _classify(self, tree, instance, default_class=None):\n if not tree:\n return default_class\n if not isinstance(tree, dict):\n return tree\n attribute_index = tree.keys()[0]\n attribute_values = tree.values()[0]\n instance_attribute_value = instance[attribute_index]\n if instance_attribute_value not in attribute_values:\n return default_class\n return self._classify(attribute_values[instance_attribute_value], instance, default_class)\n \n def classification_accuracy(self, instances, default_class=None):\n predicted_labels = [self.classify(instance, default_class) for instance in instances]\n actual_labels = [x[0] for x in instances]\n counts = Counter([x == y for x, y in zip(predicted_labels, actual_labels)])\n return counts[True], counts[False], float(counts[True]) \/ len(instances)\n \n def pprint(self):\n pprint(self._tree)\"},\"output\":{\"result\":\"\"},\"evaluatorReader\":false,\"lineCount\":55},{\"id\":\"markdownSCCm2V\",\"type\":\"markdown\",\"body\":\"The following statements instantiate aSimpleDecisionTree, using all but the last 20clean_instances, prints out the tree using itspprint()method, and then uses theclassify()method to print the classification of the last 20clean_instances.\",\"mode\":\"preview\",\"evaluatorReader\":false},{\"id\":\"codejjgkHZ\",\"type\":\"code\",\"evaluator\":\"IPython\",\"input\":{\"body\":\"simple_decision_tree = SimpleDecisionTree(training_instances)\nsimple_decision_tree.pprint()\nprint\nfor instance in testing_instances:\n predicted_label = simple_decision_tree.classify(instance)\n actual_label = instance[0]\n print 'Model: {}; truth: {}'.format(predicted_label, actual_label)\nprint\nprint 'Classification accuracy:', simple_decision_tree.classification_accuracy(testing_instances)\"},\"output\":{},\"evaluatorReader\":false,\"lineCount\":9},{\"id\":\"textyyY6nm\",\"type\":\"text\",\"body\":\"<h2>5. Next steps<\/h2>\",\"evaluatorReader\":false},{\"id\":\"markdown84WWr3\",\"type\":\"markdown\",\"body\":\"There are a variety of Python libraries - e.g., [Scikit-Learn](http://scikit-learn.org/) and [xPatterns](http://atigeo.com/technology/) - for building more full-featured decision trees and other types of models based on a variety of machine learning algorithms. Hopefully, this primer will have prepared you for learning how to use those libraries effectively.\n\nMany Python-based machine learning libraries use other external Python libraries such as [NumPy](http://www.numpy.org/), [SciPy](http://www.scipy.org/scipylib/), [Matplotlib](http://matplotlib.org/) and [pandas](http://pandas.pydata.org/). There are tutorials available for each of these libraries, including the following:\n\n\* [Tentative NumPy Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)\n\* [SciPy Tutorial](http://docs.scipy.org/doc/scipy/reference/tutorial/)\n\* [Matplotlib PyPlot Tutorial](http://matplotlib.org/1.3.1/users/pyplot_tutorial.html)\n\* [Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html) (especially [10 Minutes to Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html))\n\nThere are many machine learning or data science resources that may be useful to help you continue the journey. Here is a sampling:\n\n\* [An introduction to machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)\n\* Olivier Grisel's Strata 2014 tutorial, [Parallel Machine Learning with scikit-learn and IPython](https://github.com/ogrisel/parallel_ml_tutorial)\n\* Kaggle's [Getting Started With Python For Data Science](http://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience)\n\* Coursera's [Introduction to Data Science](https://www.coursera.org/course/datasci)\n\nPlease feel free to contact the author ([Joe McCarthy](mailto:joe@interrelativity.com?subject=Python for Data Science)) to suggest additional resources.\",\"mode\":\"preview\",\"evaluatorReader\":false}]}"}}}

melmac48 commented 9 years ago

as I understand this, this issue is occurring for this notebook only (and possibly other like it) and the user can still manually push this to github and our sharing server will render it when provided with the url. That being the case, I think this is a v2 fix

anglee commented 9 years ago

Fixed by https://github.com/twosigma/beaker-notebook/commit/8a956d6adcfae9e391efbd8a543a93aaa2bf97fc