The data and design brief for the next assignment is here.

DVS Challenge, but Shiny

Title: What We Viz With

Update 27/10/2019: I re-wrote this comment for the final submission. Before this, it was just a collection of images/links, with not much information anyway.

Beginning

From the dataset given to us by the Data Viz Society, the questions about tool usage were what caught my eye first. DVS members were asked about the specific tools they used, and what they used them for. In the question about specific tools, about 40 options were provided, along with a text field for other tools not mentioned in the 40.

Before cleaning the data, I thought of creating simple visualisations using each of the top 10/15 tools, as a sort of meta-visualisation. This would have been a nice way to learn about a bunch of tools. After a discussion in class, an additional idea could have been to talk about my experience learning each of the tools as part of the overall piece of work. The first step in any case would be to find out how many tools were mentioned, and which were the most used among them.

Across all members, there ended up being 140 other tools (181 in all) that were important enough to mention. Visualising the breadth of tools quickly became more interesting than learning a handful of them, and that is what I went with for the rest of the assignment.

Negotiating with Data

Simple data cleaning ended up not being so. To reach a table that showed how many participants used each tool, I had to manually go through all distinct tool name entries, and combine those that were the same but written differently (MS Excel, Excel, xcel, Microsoft Excel, and so on). OpenRefine would have helped for single column replacements, but that was not the case here.

Then, an unnecessary amount of pain, Google Sheets formulae, and js code snippets eventually led to a nice matrix that described which tools were used with which other tools.

Both these tables were then unwittingly fed into the Circos Online Circos Maker, leading to these two hairballs:

Tools to Participants, and Tools to Tools:

circos-table-intgeup-large

circos-table-oeiytkl-large

There obviously were far too many data points to use Circos as an effective means of visualising the relationship between tools. At this point, I thought of depicting these relationships as a network, where the size of a tool's node represented the number of people using that tool. The thickness of the link between tools could be indicative of the number of people using both tools.

I used crayons to draw what I hoped to execute:

IMG_20190920_142053636

Networking

I spent some time exploring a bunch of options for network creation. Gephi would have been nice. The issue was that it required the data to be in a particular Markup format, and I didn't want to go through the process of translating the tables I had. There were a few js based options, most prominently Sigma.js, but there were issues with data porting and interactivity that I wasn't able to quickly resolve.

R Shiny was the tool I presented in class. You can read more about it in my post here. Given the size of the matrix I was dealing with, and the lack of any clear option otherwise, I thought it made sense to try and execute this project in R Shiny itself.

R has an immense range of packages for all kinds of specialised functions. I found one fairly accessible library for simple networks, and fed the data I had into that. This is what managed to create:

Rplot

This was nowhere close to what I had drawn on paper.

Luckily, I found that R has a really useful package for Force-Directed graphs, that had already been optimised for Shiny deployment. The package is called networkd3, and is a port of a d3 package for force-directed graphs.

Networks, but Shiny

A hint of what is to come

DVS_Diamond

To make a basic network work using this package, you need two tables:

A Node table, with information about each tool, the unique ID number of the tool, and size of the node (number of people using the tool in this case)
A Link table, with information about all the links between nodes, and the strength of each link (which in this case is the number of people using both tools)

The Node table was already in place from earlier work. The Link table was created by summing each row/column of the tool matrix mentioned a few paragraphs ago.

There are quite a few parameters that can be modified to customize the appearance of a graph through this package. The documentation available at CRAN, and on this page, proved to be fairly comprehensive and very useful. I had to play around with all possible variables - link size, node size, node repulsion, colours, text sizes, and opacity to name some - till a good enough network was reached.

After this came the fun part. I had to manually go through all tools, and try and categorise them into some kind of grouping. This was needed in order to be able to colour-code the nodes in some manner. I had never heard of most of the tools before, so this required me to read up about them and figure out which group they fit into best. I ended up with about 12 groups. Some categories were - Spreadsheets, Javascript-based, Python-based, Business Intelligence, Network Generators and In-House tools. I then used a categorical colour scale to differentiate between nodes belonging to the various groups.

At this point I realised that it was possible to add on-click events to each of the nodes. This meant that I could theoretically allow users to click on nodes and get more information about each tool. I decided to go ahead and implement this. The method of attaching a js snippet to the node is simple and well-documented. I searched for relevant links for all tools, made a new table containing that information, and used that to display information on the page.

The layout of the Shiny page was taken care of by Bootstrap (which is integrated into Shiny). I changed the theme to a dark one, and played around with the positioning of the information panel. Everything was working fine from inside R, but upon deploying the final code online, the page wasn't responding to the window size. I spent an evening trying to fix that, but was ultimately unable to figure out what was going wrong. The final visualisation works best when you zoom out in browser, and doesn't work at all on mobile. I'll get back to this at some point in the future and see if I can fix it.

Sending in the entry

Finally, the title, "What We Viz With", was chosen primarily because it was alliterative. The link I submitted to the challenge can be found here (Zoom out to about 90% for best results).

Here's an image of how it looks:

annotwo

And all code involved in the final result can be found here.

In case you have any questions about anything mentioned here, feel free to ask.

That's all for now, Rishi

Making sense of the data

Started out by trying to understand the survey itself. Simple encoding through symbols:

circle: radio button (single select)
square: check boxes (multiple select)
scale: Likert rating
ellipsis: allows for text entry(open-ended)

Cleaning the data

So far, the major component of the time invested in the exercise has been to clean to data and gain meaningful insights from it. It took several passes through the survey to look at potential (reliable) datapoints which could be utilized. Several fields allowed the interviewees to add their own options, some others were completely open-ended. My approach here was to use the initial understanding of the survey and to classify the data accordingly to nominal, ordinal, interval and ratio, which could then be used to base decisions like the representation to be used, the visual encoding to be employed, etc.

Towards a PoC

Using Flourish to prototype to gauge diversity in terms of gender identity, years of dataviz experience, nationality, identification with LGBTQ+ community, etc. Currently exploring an animated bubble chart representation. Given time and ability to code, I would like to use D3 and animate this (extremely optimistically!) similar to [Obama's Budget Sliced Four Ways by NYTimes] (https://archive.nytimes.com/www.nytimes.com/interactive/2012/02/13/us/politics/2013-budget-proposal-graphic.html?hp)

Populating the graph with the initial dataset

Sized by years of Data Visualisation experience:

Shaded by Gender Identity

I am still working on coding it myself and streamlining the flow further, but it is a slow and time-consuming task, which I'm hoping to accomplish by and by

The final version as of now, is here, a Flourish Data Story with an animated bubble chart.

After looking at the Survey, I thought of taking up the branch of thought leaders. I cleaned the data on excel (manually first and then sought help of open refine). Here is the cleaned data set. who

The data was mapped in terms of the names of thought leaders mentioned by the participants. I counted the number of times a name was mentioned in that year which became the popularity index. The counts were converted to percentage values in order to normalize the varying number of participants every year. This value was multiplied by 1000 and rounded off to 0 decimals so that the final number was an integer. The intent of the Data viz. is to highlight the increase in the number of thought leaders in the data viz space. The number of participants in 2018 was less than the other two years. So I decided to omit that year to show contrast. I had also categorized the thought leaders in terms of different identities like- individual/ organization/ event etc. That became another page of the story in which the viewer gets to see who sets the trends.

The data viz. has been made on Flourish. The link to the same can be found here.

I enjoyed developing and writing the cover story for the data visualization. I could have gone deeper in the data to make observations about individual leaders. Sharing links to these thought-leaders' work would have made the story solid and end-to-end.

Initial thoughts:

# Explorative data visualization/

Firstly I thought of visualizing work experience of professionals in data viz, and peoples pay scale respect to their location and educational background, I started with looking for survey data. It has more than 1320+ entry with 49+ questions.
After some failed attempt to explore world data, and discussion with a professor, I narrow down myself to an Indian context.

# Data cleaning, data sorting, and deciding variable./

Cleaning takes a longer time as the survey has 49+ questions, I shorted country vise data first and cleaned out Indian country data. After that started deciding (variables)questions from the survey for visualization. My approach here was to use the initial understanding of the survey and to classify the data accordingly, country > city and then > professional role in organizations.
I focused on the following questions from the survey:
1. Have you studied data visualization in school (or other formal environment), or did you learn how to do it on your own?
2. Which one of these is the closest to describing your role?
3. What is your yearly pay?
4. How many years of experience do you have doing professional data visualization?
5. What is your educational background
6. What city do you live in?

# Final concept./

Deciding visual legends for visualization. After brainstorming and initial sketching, I finalized to go with a circle to represent individual entry.

reason to choose circle for exploration

we can show 3-4 variable (questions responses) by one circle, by changing the color of the circle, size of the circle, adding additional symbol and outline etc.
deciding places of professional proles in the final diagram. Here the placement of a professional role in organizations is decided by the number of professionals in each category, which helps me to visualize it the better way.

for example, the Analyst category has the most number of people working so I kept it at the outermost circle of the diagram.

# Final visualization./

The purpose of visualizing Indian professional responses is to show trends and pay-scale of professionals with respect to their working city and experience in data viz.
Here is the small attempt to show where a group of Indian data viz professional live, what are their role in a professional organization, their work experience in data viz, and yearly pay for their professional work.
the high-resolution static image of this visualization can be found here.

Timeline of Data Viz Community

Experience vs Earning vs Freelancing Initial explorations of the data in Tableau.

Experience vs Motivation

People of Data Viz This Visualisation displays the composition of the Data Viz community and tells a story of Data Viz as it changes over a 10 year period. Responses from people were grouped by when a member started working in Data Viz.

Google charts used to pull data from the source spreadsheets using query language. For this version, only data that could be visualised without any further modification was selected. This included survey questions with single choice, known options.

=QUERY(A1:B, "select A, count(A) where A is not null and B is not null group by A pivot B ",1)

The number of respondents from each year varied and people seemed to round off number of years in the field to numbers like 2, 5 and 10. Groups were made to put together almost equal numbers of respondents. Percentages of a 100% column were used to represent data of each group.

Initial issues include irregular ordering of data in columns and the lack of annotations.

FIXED: Ordering of data in chart with largest group lowest, text accompanying viz. Issue of missing vertical axes yet to be fixed.

Link to the website hosted here Link to the open files hosted here

Skipped!

There were 1361*50 data points from the questionnaire. I decided to find the trends in the questions that were left skipped. This seemed like an interesting approach to take, and some initial iterations on Google Sheets itself showed that the number of skipped questions increased as the questionnaire progressed. However there were three types of questions with different trends. 1) Questions requiring check box or radio input 2) Questions requiring numeric input 3) Questions requiring text input The trends in each were different and the decision was made to separate them out.

These trends were interesting by themselves, but they did not complete the visualisation. So two more aspects were added to the visualisation to create a build up to the final. The first one was a representation of all the data followed by the extremes of what the extremes of the data were.

The final submission can be found here.

Iteration 1 Tree Map 1

Iteration 2 Radial

Iteration 3 FINAL

https://public.flourish.studio/visualisation/690263/

How Data Visualization is created and how it is consumed?

The data survey has covered a whole range of information regarding the data visualization domain. Few of the responses give insights about the data visualization creation process and few give insights about its consumption. In this data visualization, I have tried to tell a story about how data visualization is created and how it is consumed, in the form of a series of visualizations.

I sorted the data from Data survey 2019 and categorized them in categories of Creators and Users. From this data in both the categories, I put them together in the form of questions that an audience might want to ask. Then I created a visualization as an answer to each question.

I sketched out the idea on paper first and then made all the visualizations using the Flourish tool. Using the Html webpage, these questions and visualizations are put together in the form of a series of questions.

Link for the final data visualization is here(https://dhirajdethe.github.io/dvs-challenge/)

The Story of Data Visualization Education

For DVS Challenge 2019, I looked at the data of the methods by which data visualizers learned data visualization and the number of years of experience. I have also correlated the gender and the method of learning.

Process used in creating the following visualizations

Exploring and searching for patterns using basic tools like Google sheet filters.
Cleaning up the data and selecting the required question set.
Grouping the experiences into brackets like 0 to 5 years, 6 to 10 years, etc.
Trying different kinds of visualizations to find the best suited out of the 2-3 narrowed down visualizations. eg. Line graphs in Google sheets, circular visualization using Circos, Sankey using Flourish
Finalized on Sankey Diagram using Flourish

A few explorations for the visualizations:

Trend line in Google sheets Trends in Formal Education in Data Visualization (1)

Circular Visualization using Circos circos-table-vuylsze-large

Final visualization: Screenshot (643)

You can find the visualization here

Initial Idea: After looking at the data I first thought of looking into three areas namely:

Most typically, how often does your audience use your data visualization?
Who do you make data visualizations for? Select all that apply
What does your audience use your data visualization for? Select all that apply. The initial idea was to look for a relationship between what kind of data is produced for what purpose and the frequency at which the data is consumed.

Although the idea was appealing to work upon, I faced lot of issues while cleaning the data as two of the areas had a multiselection type answer. After trying hard to refine the data I finally decided to work on some other Idea.

Final Idea: To make the task of refining data simple and also to come up with a good presentable visualization I decided to take up an area that is less complicated but meaningful. Therefore I decided to work with the following questions from the survey:

Have you studied data visualization in school (or other formal environments) or did you learn how to do it on your own?
Are you a freelancer/consultant?
Which one of these is the closest to describing your role?

With these areas, the idea is to bring out the relationship between how people acquired the skill of data visualization to perform what kind of role. Also, the relation could be found when looked into how many of these people are working as a freelancer/consultant or are working full-time with some organizations. For eg: There are 414 analysts, out of which 81 are freelancers/consultants, 8 are occasional analysts and remaining work as a full-time analyst.

The data was cleared using Ms excel and the visualization has been created using flourish tool which is here

People in Data Viz

A career counselling kinda thing, but backed with data...

As I went through the data for the DVS challenge, I realised there are many fields and professions that make use of data visualisation as a tool. Also, it appeared there were many people doing the same thing. So I initially began this with the idea of figuring how unique a person can be. But this was throwing myself down a rabbit hole, given my limited knowledge in querying and manipulating data in realtime. Thus, I decided to pursue a different question.

As a novice in this field, I was interested in knowing what future possibilities lay in my path as I studied data viz. This exploration came about as a result of that curiosity, to understand the metrics and judge whether I would want to pursue data visualisation in the future with greater vigour.

I also realised that this is something others might also be interested in knowing. Thus, I began looking at the questions that would help inform me and provide the data to back my decision.

I primarily considered questions that pertained to things I was doing, such as educational background, process of learning data viz, etc. and things that informed me of the future possibilities, as with the job prospects, pay scale, etc.

As I went through this, a few other factors such as gender and experience also began to seem both relevant and interesting. I chose to keep the professions as a base differentiator, as this essentially would help inform my decision. The data was then cleaned to have over 1000 responses with all questions answered. Ambiguity in answers was removed through this, such as grouping genders from the over eight options to four categories.

In order to show this, I decided to create a declarative data story through Flourish using bar charts and bubble charts. I have chosen a total of 10 questions to help answer my questions. Using Flourish allows me to tell my story, while also letting the user explore other visualisations. The order in which I bring about the questions is related to the answers I was most interested in knowing.

The story can be found here.

Initially the idea was to look at the number of years of experience and the background these people come from and see if those two could be related. When going through the responses of various questions, I came across a lot of similar answers, some very unique responses and typos. I specifically analysed the spelling mistakes in the names of the thought leaders mentioned by the people. Palladio was used to clean up the data to find the most mentioned thought leaders and the spelling mistakes made by men and women separately (turns out men made more mistakes as compared to women).

Initial idea was to illustrate the most influencing thought leaders along with a sankey diagram of how their names were spelled/misspelled in a fun sort of way but was not carried forward.

Misspelling a name can lead to a lot of confusion (name being an identity of a person and his work). To better depict that a poster was created where images of the thought leaders was treated according to the percentage of misspelled responses. The grain size for the image increased with the increase in the number of mistakes resulting in a more blurry image. All the responses were taken into account irrespective of the gender for this.

Another way to look at this would be thought leader in the data viz world with the hardest names. However only the top nine thought leaders have been taken into account as after these, a lot of people were mentioned equal number of times.

***SKIPPED

What interested me in the data given by the DVS 2019 challenge were the questions that respondents had decided to skip.

The possibilities were to look at the kind of questions respondents skipped most and the associated trends. Another approach that was iterated was to analyse at the answers to try and determine if there is a significant variation between (which I could try and revisit with the new viewpoint and tools that the ongoing statistics course has equipped me with).

Along with the story that I wanted to tell, I had to decide on the tool that I will be using to tell that story. I started experimenting with Tableau, however, my level of expertise in the same prevented me from using it for exploring the data. This led me to do the same in Google Sheets with the use of conditional formatting and formulae. The Sheets experiment resulted in interesting visuals which I decided to capitalise on. The aim was to make a Google Sheet look like an illustrator document.

I experimented to find out the interesting trends, the types of questions, the outliers and the sections of questions that skewed the data. With this, I cleaned up the data to get to trends over duration (over the entire course of filling the form), analysed the questions which had significantly high or low skip rates separately.

The final submission can be found here.

Thank You!

Hours spent in a week by the DVS survey respondents

The final assignment of the course started with discovering the amount of work happening in the Data Viz community. This was the first time I knew that the community of Data Viz is growing after looking at the survey and the responses to it. The survey was among the DV community to understand the differences in practice, tools and audiences of anyone making data visualisation. Following is the ink to the original survey: Original survey form

Survey for 2019 has responses to 50+ questions and was taken by over 1,350 people. It covers professional data visualization details like salary and hourly compensation, tool use, location, demographic data, audiences, organizational structure and more.

The assignment was to look at the responses data and guess what, Visualise it! I downloaded the responses data sheet from here. I created another sheet in the file which enlists all the question from the 3 years and interestingly there were additions. I started identifying the questions which were common among the 3 years and I started highlighting them:

I noticed that the question about the time spent by the respondents on Data engineering, Data science, Design & Data prep work. Since there were respondents with different professions I wanted to visualise the hours spent in a week by each respondent for the above mentioned categories.

A link to the final visualisation is here

DVS Survey Challenge

Starting out Dividing and chunking the 50 questions. IMG_20190915_203228 IMG_20190915_203235

The struggle to work towards finding the right combination of data sets in order to have an interesting story itself was hard to find.

Use of Data viz tools among respondents

From DVS survey, I picked up the part of respondents and their use of data visualisation tools. The first task was to clean up the survey results. This exercise itself was tedious, but good learning in itself. Though my intention was to plot each of the user with the tools used, later I simplified my ambitions to much more achievable levels.

excel

By using Rawgraph's sunburst diagram, plotted the software vs their number of users. This particular diagram visually shows the proportion of users of each software.

attempt 1

Use of data viz tools in Journalism

From the above data, I just attempted looking at the specific field of journalism and use of tools. Out of the total 1360 respondents, 71 mentioned journalism as their sector of job. Used an alluvial diagram in Rawgraphs to see the results.

journalism numbers

PS: Thanks to Malay for helping me with excel count

The Data set seamed large and mostly qualitative. I decided to color code the data for the categorical attributes and then compare them against each other and find some correlations.

Screenshot (322)

Screenshot (335)

I gathered some selected attributes and sorted them to see some color patterns emerging (indicating patterns in the data)

Screenshot (336)

After making the cells extremely small and zooming out all the way I could see the bigger picture and the data already getting visualized.

1. The varied roles of Data Visualizers and income

Role and income DVS-01

These were some quick data exploration that turned into visual confirmation.

We can observe the concentrated white bars over leadership roles and engineers indicating high income and some darker bars over students, scientists, designers and academicians

2. Years of experience and income

Exp-01

There are lighter bars at the end (20-30 years experience) but a lot of light red areas are seen #### even with <5 years experience.

Note: The scale is skewed because the length of interval is determined by no. of responses for that interval of data (as this is taken directly from excel)

3. Which role has people hired as full time Data visualizers and which one has it as a part of job

Part DVS-01

The blue bars indicate that they are hired to do Data Visualization specifically I did not expect these results but it seems the blue bars are mostly hovering over designer and also many over developers

After these explorations that indeed revealed some visual patterns, I decided to narrow down on these roles and create a visualization around:

Who makes data visualization for whom?

How people with different roles interact as both creators and consumers

What I could visualize in my head was similar to an Alluvial chart

The required a lot of cleaning

Screenshot (337)

I filtered out some major responses and grouped the rare ones like librarian, lawyer, Government officials, investments etc. into others.

Screenshot (338)

And made the data suitable for an Alluvial chart template.

The Data revealed how most of the data viz is done for executives and project managers and the major contribution towards general public consumption is by analysts and designers.

Screenshot (344) Screenshot (343)

The final VIZ can be seen here

A guide to 'Data Visualization as a job'

When I first saw the data, Original survey form, I was overwhelmed by the sheer number of answers. After seeing the types of questions, I started forming some affinities and tried to find something interesting in Tableau. The initial ideas were to look at correlations between questions like :

"Are you able to choose your own tools or are the choices made for you?" and "Hours a week focused on creating/implementing/productizing data visualizations?"

or "What focus is data visualization in your work?" and "What country do you live in?".

After going through the data, some more I saw an opportunity to explore as well as answer my question of how Data Visualization can be as a job. After realizing what question to answer, I made some visualizations in Tableau.

Once I had set the central theme, the questions I had to answer through the visualizations became clear. I cleaned the dataset for the set of visualizations, which is linked here.`

The final story was made using flourish which is posted here.

I initially tried to segregate each tool to find what tool is used most. But Data cleaning was too tedious so, I created visualization which sees what number of men & women are into Data visualization from different countries. Here is the link.

info-design-lab / DE705-Interactive-Data-Visualization

Data Visualization Community Survey 2019 #2