Open ehuppert opened 4 years ago
Thank you for the fascinating paper. The academic journal reviewing process (especially in the social science) is increasingly requiring papers to show strong evidence via rigorous research designs. In the spirit of your paper, do you think that the emphasis on rigorous evidence is constraining the academic profession from pursuing truly novel yet difficult-to-test ideas?
Thanks for the inspiring paper! I have two questions in general:
Now that I've read the 4 features for the ideal empirical setting for the study of the influence, I've understood why you chose to use the newspaper industry as your sample, but I failed to understand why the newspaper industry is better than other possible industries to experiment on and as I read till the end I saw that concern in the discussion. Could you elaborate more on how the outcomes might change if we switch to another industry? industries that are less prone to using A/B tests?
You have mentioned in 3.3 that the radical transformations such as launching a mobile version are excluded from this article and that practical constraint, financial limitations, and ethical concerns are not included as well, how would we possibly adopt that into the model?
For anyone interested in the use of the Wayback machine, I would like to recommend a 5 minute reading from Diana Kwon, Sep 10 2020, Nature - More than 100 scientific journals ave disappeared from the internet. Seems like WB machine is gaining more popularity than before.
Thank you! I was struck by your example that pulls from the history of science to demonstrate how experiments can be innovative and paradigm-shifting. However, there was a physician before Joseph Lister that used natural experiments to link sterilization to infection - Ignaz Semmelweis. Why did one set of experiments lead to wide-spread innovation (and countless lives saved), while the other was largely forgotten? I am interested to hear your ideas about what factors outside of the experiments themselves lead to actualized innovation.
Thanks for sharing this work. I am impressed by this novel way of measuring innovation by comparing the coding similarity and how you employ methods from different disciplines. I have some questions about the newspaper industry. You mentioned that "most newspapers do not have enough computer and data scientists in-house to build their own experimentation program." I wonder how you get this conclusion? For evaluating performance for the newspaper websites, I think people tend to have some habits about reading particular newspapers, and the subscriptions of newspapers are often by month, quarter, or year. It can also be the case that people visit the websites by clicking social media links, email subscriptions, push notification, or by searching. Would that makes users be less sensitive to the structural and design features of the websites? Could you explain more about controlling for performance to isolate A/B testing effect?
Thanks for presenting your work to us! It was really interesting, and I can't wait to see the full version with figures included.
The quote about avoiding 'intellectual laziness' from your paper really stuck with me. What types of strategies can companies (and academics alike) use to effectively search the space of radical change while iterating using A/B testing? To me, this reads as an origins story for the consulting industry - sometimes a fresh pair of eyes might be exactly what is needed to break out of a small-scale, low risk/low reward strategy.
Also, what does your research have to say about the 'brand value' of a site, and the comfort users often develop with certain interfaces and styles? Facebook, Twitter etc. often launch refurbished interfaces to waves of criticism. In this case, it almost makes sense to be as inconspicuous as possible with changes - justifying the A/B testing approach. Is there a real need for radical change in these design choices?
Hi, very interesting topic. Just gonna leave a very general question. What do you think about the difference between A/B testing and statistical hypothesis testing. Is the difference just nomenclative? I've met people who dismiss data science as nothing more than a rebranding of statistics. What do you think? Do you think it will be counterproductive for the same concept to have many different terms?
Thanks for coming to our workshop. I do not really have a lot of background on A/B testing or web design in general, but this was a very interesting topic!
Thinking in an exploration-exploitation framework, I wonder if this could be seen as a specific case of that behavior. An optimal solution to almost all explanation-exploitation questions is to explore early and exploit later. Taking this into account, a media owner might want to explore a lot of options through A/B testing at the early stage of implementation and go with the branch of an idea that did best in the early exploration stage. If a webpage structure reaches whatever benefit the owner seeks to achieve, then according to the optimal solution, there really is no need to explore more - you exploit it, which will reduce the radical change. I guess this is also similar to optimization algorithms jumping through the parameter space in large steps initially but changes into small steps at a later stage.
In this point of view, A/B testing, which lets people compare web design ideas easily, might be just a tool that lets people use the optimal and logical solution of the exploration-exploitation problem. Thus, maybe this is "rational" for the companies as a whole, not just for the managers. I guess this is still detrimental to innovations in general, but if this is a rational model for companies as a whole, I don't think this is something that can be addressed. Not sure if I am making a coherent point right now, but I would love to hear your opinions on this issue.
Thanks for sending around your paper! My question is about the causality of the relationship you found. Logically there are three possible causal structures, adoption of A/B testing leads to people not wanting to try radical changes, people not wanting to try radical changes leads them to adopt A/B testing, or these two things are caused by an unseen shared cause. All three of these would explain the phenomenon we've observed.
One of the interesting points in particular is in your figure 9, reproduced here. It seems that there are substantially more changes with a high similarity measure in general. This seems, to me, to show that these websites are more likely to make small changes whether or not A/B testing is employed, which makes it more difficult to tell the exact causal relationship. What are your thoughts on this?
Thanks for sharing your work, I found it really interesting! I'm wondering about the ethics of A/B testing, and would love to get your thoughts. Presumably, web surfers won't care too much if you're testing whether a border should be three pixels or five pixels or this or that shade of blue, but larger tests may cause some concern, especially on newspaper websites. Everyone wants their news to feel like its being impartially reported, not subliminally marketed to them in one way or another. It obviously gets worse if the newspaper starts trying to guess some information about you (age, gender, race, etc.) and using that to tailor its presentation. And I know in your paper you only measure formatting changes, not the content of the news presented, but I imagine this formatting/content divide is not totally clear cut: changing the width of a column seems like pure formatting, but when that column always houses an article on a particular topic area (say, "Foreign Affairs"), that starts to feel like A/B testing for content.
With all that said, I'm wonder whether the ethical concerns could be a mechanism for slowing innovation at newspapers. Essentially, someone might have an idea for a big change to the website, but management won't do it because they have fallen in love with A/B testing, and they won't do an A/B test on this change because it feels ethically wrong to present the population with two different versions of the news. Any thoughts on this? Thanks!
Thanks for providing a brilliant example of combining multiple databases into one solid piece of evidence. The proliferation of A/B testing is made possible by this era of Internet and makes it possible for researchers as humble as us students to benefit from the robustness of experimentations. Indeed, I find your argument for potential drawbacks of the method most convincing and your evidence generating process most thorough. However, I am quite concerned that the newspaper industry, which you selected as the example for testing, perhaps due to the availability of data, may not be externally valid for most innovation scenarios in either natural or social science. Essentially it is due to that while academics or technicians seek novel improvement of knowledge via innovation, media or news outlets these days mainly focuses on maintaining their old audience and attract new ones. In attracting new readers, newspapers would not incur radical changes to draw attentions of the groups that are very different to their previous readership. For instance, Washington Times, which is an example used in your paper and a bigoted conservative press, would not adopt a radical change to attract readers from the liberals and forfeit its original endorsers. Therefore, I think the newspaper industry has deeply embraced incrementalism to maintain paper-based readers or online subscribers, which contribute to its revenue, and is averse to innovation in the first place. Consequently, attributing its lack of innovation to A/B testing is insufficient of evidence. I understand it might be more challenging to gather samples but I suggest using cases in scientific research such as projects submitted to NIH, whose intentions are coherent with innovation.
Thank you, Dr. Berk Can Deniz, for your excellent paper! I was interested by how controversial A/B testing can actually be and how useful it can be within companies.
I agree that cost is a big factor with digital randomized controlled trials. Because of this cost benefit, even small companies would be able to use A/B testing without having to worry about hiring new employees to run tests. There is also mention of attention to detail. With A/B testing, companies are able to change the smallest of details in their websites and improve on them.
My question is, since cheap digital experiments are becoming more readily available, is there a limit to how often A/B tests should be run or who should have access to them? Is there a specific number of A/B tests that should be conducted to avoid incrementalism and exploitation? How do we know when we've crossed the line between exploration and exploitation?
Thank you very much!
I have some questions concerning your choice of the two control variables - "Tech_Stack" and "Log_Page_View".
I understand your intuition is based on the model designed by Koning, Hasan, and Chatterji (2019). I've also checked that paper. And the reason why they're using the "Technology Stack Control" is that they're studying the performance of high-tech startups, which means that those companies would implement diverse tech stacks to design their products. Therefore, we can see from their TWFE regression results, that by adding the control, the estimated coefficients drop dramatically. Hence, we could be convinced that they've chosen an effective control variable.
However, by comparison, in both Table 4 and 5 in your paper, we can see that after including the controls, the changes in the estimated coefficients of AB_Testing are subtle, and so are R^2 s. Meanwhile, the coefficients of the control variables are not always significant and even they are, the coefficients are extremely small. Another thing is that from Koning(2019)'s literature, we might agree that "tech stack" and "log page view" are actually correlated. So, wouldn't it cause some issues if both are included in the model (I don't actually understand Figure 8)? You interpretations in the paper are somehow reasonable, but I wonder maybe there're better covariates we could choose for your model. I'm very interested in your opinion.
Thank you for sharing your work with us! I wanted to ask:
In terms of when or when not to utilize A/B testing, I wonder if the concept of statistical power plays any significant role? In general experimental contexts, power analysis can provide an a priori indication of whether or not it is worth conducting an analysis given the input data. If the concern is that significant resources would be expended to conduct an A/B test and the result is incremental, wouldn't it make sense to utilize some kind of criteria to determine when it is appropriate to utilize this framework rather than going for high-risk, high-reward strategies?
Thanks a lot for your presentation! I have a similar question with @SoyBison about the causal relationship you generate here. I am wondering how you think of this question? Also, I am wondering what impact do you think A/B testing would have on the industry? For instance, productivity level. Thanks a lot!
Thanks for the interesting paper! I'm wondering whether this general conclusion of linking A/B testing and incrementalism can be and should be generalized to academic research as well. Because I think A/B testing is quite similar to most economic/psychological experiments, where researchers tend to make the smallest change possible to create the ceteris-paribus condition and identify the causality—in this sense, such 'incrementalism' seems to be valued. Hence, if not, what do you think lead to such divergent views on A/B testing under different settings? Thank you!
Thank you for your presentation! I was at an Uber tech talk at Uchicago last year and they talked about using the A/B testing whenever they launch new products or features. What was very interesting to me was how they do it dynamically. That is, they would do it in a rolling-out manner, carefully monitor the performance of the testing over time, and optimize the process along the way. Do you have any idea if this would lead to the same outcome you found with the general A/B testing? Would conducting A/B testing dynamically encourage the small changes, or would it in fact encourage radical changes?
Thank you for sharing your work. You mentioned that you deliberately left out news outlets that are natively digital, which might manipulate the content rather than just the design of their websites. Do you expect different outcomes when the magnitude of change is measured through the less visual components, such as wording? How would you go about measuring changes in that case?
I am really happy and grateful that we can have you at our workshop. I have been learning A/B Testing since this September and your work greatly enriches my understanding.
In my perspective, I would say one of the most challenging parts of this paper is to distinguish incremental changes from radical moves (you also mentioned this on P18). And I really like the idea of tackling this problem through HTML and CSS to incorporate structure as well as style similarity. Based on my understanding, the "similarity" is more of a measurement of visible difference. So I am curious if they also capture the differences where changes were implemented?
For instance, if a website has its logo right beside a help icon, since the logo is of much more importance than a help box, I believe a color change from light red to dark red on the logo is definitely a radical change, while a similar color change on the help button might more likely to be an incremental move. I am wondering if the algorithm can safely detect this situation and correctly classify them?
I am not sure that I understand the detailed logic behind them, so it would be really appreciated if you could elaborate a little more during the workshop.
Thank you for coming and for giving the presentation. My question is similar to @SoyBison, what is the temporality of the causality between the incremental action and the adoption of the A/B experiment? Moreover, I am further interested in knowing that after companies know about the A/B testing might create incentives for incrementalism, would employers discourage or even abandon this method? Also, what kind of mechanism would increase employees' subjective initiative?
It's fantastic that we can discuss AB testing in research area. I have practiced some AB testings in the industry and sometimes we do not split all users into A and B groups, but only select certain or random people and do the AB splitting inside the group, in case of influencing a bigger picture or they suppose the strategy may only affect certain people. From the data, 218 out of 297 companies performed AB testing, but it could be that some companies did not perform the testing on the whole user base and your account might be excluded. I wonder if we should get rid of this. Secondly, I agree that search performances can be an observable factor but I wonder why you see changes on webpages as an index to indicate (positive) innovation? Your assumption here seems to be that faster changes on website mean greater innovation and bolster higher search. But for companies without AB testing, the effect of changes on website is unknown. The changes could increase the search number but can also works to the opposite. I'm confused why the similarity rate would be a factor to be measured for 'exploratory innovation' as it has two-sided effects for non-testing companies. Thanks!
Thanks for the paper and the presentation! I want to know whether the general conclusion linking A/B testing with incrementalism should also be generalized to academic research.
Thank you for sharing your work! I found the paper very interesting and the ideas presented and methods used very inventive. I noticed that in measuring innovation the focus was on radical change and similarity. In relation to this, I was wondering whether innovation can be measured as a change with a positive outcome, instead of just relating innovation to the magnitude of change observed? In addition, I was also wondering if it could be the case that incremental changes, as opposed to radical changes, have greater contributions to the companies’ performance. For instance, companies may consider that radical changes in their websites may have an adverse effect on user experience.
Thank you for the sharing! My question is that different strategies have different targets and if we only use the search increase as the measure index, would it be partial? For example, some ab testings are set with the goal of increasing interactive rates, and the changes on the webpages are designed accordingly, but it does not necessarily improve the search. Therefore, I'm quite curious why you use such measurement. Thanks!
It's really exciting to see A/B testing is put forward under a rigorous academic discussion. My initial concern, as already mentioned by my peers multiple times, is the underlying causality between the incremental action and the adoption of the A/B experiment. Much appreciated if you can spare some time addressing this issue in depth.
Will the cumulative effect of progress from A/B testing be more popular with companies than one radical change?
The improvements made by these A/B tests reduce the likelihood of radical change. However, is the cumulative benefit to the business of frequent A/B testing over a three-year period greater than the benefit of radical change that occurs only once every three years?
Thanks!
Thanks so much for sharing the work!
I am very interested in hearing more about how A/B testing could potentially contribute to other realms beyond media and business.
In this paper the author mentions that there could be several different potential mechanisms to explain this result. But given the limited data and other information and conditions, we can not tell the true mechanism behind it. So I wonder what condition or data are needed to advance this research.
Fascinating work! Especially the way you operationalize similarities using web structure.
I wonder if there is something fundamentally different between radical changes and incremental changes? Could we equate radical changes to a large number of incremental changes happening in a small time frame? If so, is it possible that it's not experimentation itself that causes slow down in innovation, but rather the inefficiency of A/B testing?
Could other testing tools like multivariate testing and Multi-Armed Bandits testing tackle this issue?
Thank you for sharing your excellent research.
You identify possible negative impacts to firms if they did not realize the potential perils of experimentation. So what is you suggestion for them to avoid those risks, either from policy making or from methodological choosing? Thank you.
Thank you for sharing your work! It was really interesting to read about how A/B testing is becoming ubiquitous and the impact that this ease of testing has. It seems that since innovators tend to come up low-risk ideas, the tests are being run on what would be small changes, and thus only encourage incremental improvements. I was wondering what you think the effect might be if the company were to include ‘random’ ideas (divergent from the trends you explained in innovator-created-ideas) in the experimentation? Since A/B testing is so accessible, it would presumably be pretty inexpensive to add additional tests. Could the inclusion of random ideas possibly encourage innovators to expand their ideas for testing by adding diversity to the type and scale of ideas within the experiment?
Thank you for your work and especially for diving into the newspaper industry! I have a question about the limitations you mentioned. Given you focused on the changes in experiments of newspapers' designs, you mentioned media wouldn't tend to experiment with a change to content. As far as I know, the more comprehensive and larger a newspaper is, its content will include a greater variety. This may lead to a greater chance of the correlation between their design changes and their content changes. For instance, the purpose of its design changes experiment is to make partial content more salient. Would it be possible to incorporate this consideration?
Thank you for your presentation! Do we need to control the scale for A/B testing since we are not unclear about avoiding the potential negative impact?
Hi Berk! This is really an interesting paper! I am kind of confused about A/B test. How do you decide which size to test for this random experienment? How do you prove that the results are robust enough to be generated to the entire population?
Thank you for sharing the interesting paper! I'm curious about if A/B testing can be applied to other fields beyond business.
Thank you for presenting this wonderful paper! I have little idea about A/B testing before, but I have a few concerns regarding the data selection process. As mentioned in the paper, you construct and utilize a historical archive of newspaper data. I am wondering what do you think of the limitations of using newspaper data in business research since newspapers seem to be a little bit out-of-date in capturing the latest updates of business information. What prevents you from using traditional firm-level economic datasets or web-archives? Thanks again!
Thank you very much for this paper with unexpected conclusions. I did not realize that experiments can hinder the occurrence of breakthrough achievements in a given area. My question is: could we regard the polls made by public opinion research centers (such as Pew and YouGov) as incremental experiments? I think as time flies, these research centers have been improving their research methodologies and prediction models bit by bit. Do these minor changes in a way, or to some extent, impede fundamental changes in their research capability of accurately presenting people's real opinions? Thanks again!
Thank you very much for sharing this interesting paper! I have learnt much about both experimentation and A/B testing. I have two specific questions regarding A/B testing within the context of this paper:
Thank you very much!
Thank you for sharing your work with us! I find it very innovative to apply A/B testing in academic reasearches, I'm so looking forward to your speech tomorrow. I'm also curious about the data collecting part, I think historical archive of newspaper data may have potential selection bias, how do you see this affect your work?
This is one of the papers I enjoyed the most in this workshop! I am pretty inspired by the computational differentiation between stylistic and content similarities. My question is, what do you think is the main reason for a "radical change"--I am hypothesizing that external factors such as the change of website managers and the shift of the newspaper marketing strategy might be more influential than an innovation of the website design itself.
Thanks for the wonderful presentation! This is really insightful. How do you think this test will benefit other disciplines in sociology? Is there some ways for other subjects to learn from the results? Thanks!
Hi! This is a very interesting study which shows us how to use big data resources to study the change without having to conduct a longitudinal study. I have a question about the dependent variables. I understand that the similarity or difference between website pages is a robust indicator, but how can you make sure that this indicator has higher enough validity and could measure what you really care about (i.e. innovation)? I mean, can the change in the code accurately reflect the innovation of a website? I thought the level of innovation may also need subjective evaluation. Thanks!
Hi, thanks for sharing this fascinating study. My question is: if A/B testing encourages incrementalism and the lack of novel ideas, who or what is driving this increase of A/B testing in the newspaper/other industries? I think it would be very interesting to investigate what kind of ideas are being A/B tested and what group of employees (i.e. mid-level managers) adopted the use of it most vigorously.
Thanks for sharing your research with us. You mentioned the agency problem and how it would be aggravated with the application of randomized controlled trials. For future research that aims at understanding how firms can mitigate the negative impact of experimentation, could you share some of your thoughts on what could be done to overcome the agency problem?
Very interesting research! Thank you so much for presenting. I’m just curious about what are some real life implications and how do you plan on proceeding?
Hello, thanks for sharing such interesting research. My question is, what other testing methods do you recommend that may mitigate the problem that A/B testing presents? Is there another method that may be as easily implemented, but encourage more innovation when developing design features?
Thanks for sharing! I'm wondering how the conclusion could be applied in the industry
Hello, thanks for sharing such interesting paper! I am also curious about the application of A/B test in the wider field. Thank you.
Thanks very much for your presentation! A/B Testing is a very curious topic for me and it seems to me that it is a business version of control variates method. I am curious about your opinion on A/B Testing's origin of development and its flexibility in terms of forms and application areas.
Another thought provoking paper for this week's workshop. I have a lot of questions about this study, I'll try to list a few.
First of all, can we really reach the conclusion of experimentation curbing innovation by looking at design changes in newspaper websites? As you mentioned in the paper, content is not included in the study, but I think news content is the actual product offered by newspapers and website is just a publishing platform. It seems to me that in order to measure innovation in newspapers, we should look at things like methodologies they use in collection/analysis of data, or stuff even more difficult to measure, like the new perspectives they bring into the field journalism etc. I do not feel convinced that big vs. small changes in website design can tell much about prospects for innovation (or maybe I am putting too much meaning into the word 'innovation').
Assuming that really is the case, I feel like such testable design decisions and significant innovations belong to different categories. I see no reason why can't a firm be involved in both of them simultaneously, they just do not feel mutually exclusive to me. In fact, such experimentations can even boost innovation by making various versions of that innovation A/B testable, and preventing actually useful innovation from failing because of simple design flaws. Aren't big changes in websites just like a compounded version of smaller changes? Also intuitively it feels like organizations adopting A/B testing, in addition to making smaller changes, are probably making more frequent and profitable changes too, since they can easily test out which option is better, so in a fixed amount of time the total changes in a website might not be that different between two kinds of websites (e.g. 10 small changes vs. 1 big change in a year). Is there really more to it than good old high risk-big reward relationship?
Finally, if that is the case and testing really disincentivizes innovation, do you think proliferation of methods like A/B testing will favor smaller firms that are willing to take risks in pursuit of innovation? The principal-agency relationship seems to be more apparent in established firms with vertical (hierarchical) organizational structure and, as you said, A/B testing requires a large user base to yield statistically significant results. Does this mean that younger start-ups will disproportionately lead innovation as they have much smaller user bases to test and relatively less incentives to implement incremental improvements?
Thank you for sharing your work. It is fascinating. I have a question coming from thinking about the role of companys/ managers. If we look at the setup of the market and modern economical order, it really is not the job of a company to 'innovate'. Their job is to "do just enough to generate revenue". In this sense, isn't incrementalism a necessary outcome? simply said, the providers are probably after marginal benefit for the investors more than creating radical innovation for consumers.
Comment below with questions or thoughts about the reading for this week's workshop.
Please make your comments by Wednesday 11:59 PM, and upvote at least five of your peers' comments on Thursday prior to the workshop. You need to use 'thumbs-up' for your reactions to count towards 'top comments,' but you can use other emojis on top of the thumbs up.