General Questions I've Had

harriet-mason commented 2 years ago

1) The absolute distance between two groups seems to work better than the relative distance for most of the scagnostics 2) trying to keep a global range in mind in the scagnsotics didn't exactly work whoops. Does tour not keep the values on a unit circle if the data set is standardized? 3) the variables in the tour are all PC, are those principal components? as far as I can tell it should be the normal variables so I'm a bit confused. 4) some of the scagnostics can see a difference when calculated in the two groups, but it doesn't work in the projection pursuit. I'm not sure if I know enough about tourr to work out if the problem is noise or myopia in the index 2) I'm not sure what optimization function to use in the tourr or how much that will matter in this case.

harriet-mason commented 2 years ago

Gist of Ursula's responses from the meeting 1) Try setting a threshold. 2) Set global range with maxdist function from tourr (but just use the code) 3) read documentation, dont have the "sphere" option in the tour 4) optimisation - read sherry paper. If splines finds the numbat some of the time, it should find it all of the time, and it is a problem in optimisation.

harriet-mason commented 2 years ago

Ok so I have a couple of new notes for tomorrows meeting with Ursula

1) Thresholds

Tried setting thresholds for each scagnostic individually, I set the "nothing" baseline as the maximum value from the normal noise plots because those are as much "not interesting" as you can get. In hindsight that probably wasn't the best move because all groups are either 0 or 1 (so it has a similar problem to before)

2) What to do if the numbat is low on a scagnostic.

Some measures (such as outlying and skewed) can "see" the numbat, but the numbat returns a smaller value than the noise instead of a higher value. I cant tell that I should invert the convex scagnostic but for these other ones it is less intuitive and im not sure if I should worry about it.

3) Optimisation

I read Sherry's paper and tried to visualise the current optimisers with ferrn to see if I could tell what was happening. I then ran into some issues.
1) I can't use explore_space_pcr() to compare the optimisation methods. As far as I can tell I have set it up correctly but i keep getting the “Ferrn will perform PCA separately on each dimension” message. I tried to read through the source code to find out why this was happening but I couldn't figure it out and gave up.
2) Several of the scagnostics can "see" the numbat, and the value for the (x4,x7) scatter plot is significantly different to the value for the non-numbat scatter plots, but the projection persuit does not find it. Upon trying several optimisation techniques, it seems that the index value that the projection persuit finishes on is higher than the value for the (x4,x7) scatter plot even though it just looks like noise. Idk what to do about that.
3) Splines could see the difference. I ran the tour using a seed(8) (it took 8 tries to get one that found the numbat) but when I ran it again it didn't find it so idk if im going to be able to show it in tomorrows meeting.

harriet-mason commented 2 years ago

Notes From Meeting 1) Abandon Thresholds 2) Doesnt Matter 3) Try different variations with step size and stuff. Submit an issue on Ferrn github for the explore_space_pcr() problem. The smaller value problem might not be an issue, check the trace plots on the scagnostics to see what happens. Also make sure to save the tours. Better to run a bunch of tours, let them run, save them and and watch them later. 4) New things: make different data sets with noise, two different features in different gorups, and also two different features in the same group.

harriet-mason commented 2 years ago

Ok I might still do some more work tomorrow, but I'm running into a couple of blocks so I thought I would give an update now just in case.

Just a note, I might add my housemate Tom to this GitHub repo to run some tours overnight on his desktop. Running the tours overnight on my laptop (and now after running stuff all last night and today) is slowly killing it and it's running super slow.
I have three simulated datasets and all have only 5 variables to keep them small for the time being:
a) Feature vs Noise: Group A: L-shape feature on two variables and noise on the others, Group B: All noise b) Feature vs Feature: Group A: L-shape feature on two variables and noise on the others, Group B: nonlinear feature on same two variables and noise on others c) Multiple Features vs Noise: Group A: L-shape feature on two variables, nonlinear feature on another two variables and rest are noise, Group B: All noise I have checked the scagnostics for all of them and they all have some scagnsotics that can recognise the features group. The tour projection pursuit has only been done on Feature vs Noise though.
I ran a big grid of tours for the feature vs noise simulated data and some values got pretty close, so with polish, I think it should work well. I have been scoring the tours when I watch them in this google sheet: https://docs.google.com/spreadsheets/d/1KfVjqpNiHhPhYmz7hzXZZQUjSL-MnuX4UeX7RIRv33I/edit?usp=sharing So far I have only done a range of values for convex and skinny, but I'll probably run the same for outlying and monotonic (although I don't think monotonic is rotation invariant so I'm not sure how that will go).
I haven't been able to get any trace plots because I have now been saving the tour to watch later (instead of the animation_xy data frame) with save_history but when I run try to make the animation data frame using animate_xy and planned_tour() the data frame is null. The only time I can get the PP object that is used in the fern package is by saving the animation with the guided tour as an object. Is this normal or have I messed something up?

harriet-mason commented 2 years ago

Ok so I have a couple of extra comments before today's meeting.

1) A lot of the scagnostics have a bit of inconsistency between values of alpha and different seeds and finding the shape in the projection pursuit. Some values of alpha seem to perform better than others, but even those are reasonably inconsistent. Since I have only changed alpha for these tours, changing max_tries could make them more consistent, but if I'm honest I'm not sure what max_tries does exactly. From the tour documentation, I thought it was the number of bases it checked around the current one and its default was infinite so decreasing max_tries would decrease the search space and decrease the chance of finding the max. So yeah idk what it does haha.

2) For some reason a few of the tours didn't save properly. It's not a major problem, but it is odd that seemingly random tours did not save. The file is there, but when I read it the R object is null.

3) As a general assessment of the scagnostics I've tried: convex and skinny worked pretty well despite the inconsistency around the alpha values. Outlying worked sometimes but I think because it is one of the noisier scagnostics it struggled regardless of the alpha value. Adding in robust outlying when I fix clumpy2 could be a later project. After trying the tour on monotonic, I'm pretty sure it isn't rotationally invariant.

4) I tried some tours on the Feature vs Feature data which worked pretty well, on the handful of splines tours I checked. I'm not sure if that is because splines is not as noisy as the other scagnostics, or if it is because the Feature vs Feature problem is fundamentally easier for the scagnostics to see than Feature vs Noise. The Feature vs Feature tour log is here

harriet-mason commented 2 years ago

Hey so I have done a couple of things but I might not do other work before the meeting (I'm not sure what I can do).

Just a note that I still can't make the trace plots because I don't know what is causing the difference in the save_history and the animate_xy functions.
I have tested a number of combinations of alpha and max.tries but it was rare that the shape was found every time. There is a table in the Google sheets that summarises this, but skinny found the shape every time with alpha=0.7 and max.tries>=100, the best case for convex was 3/5 times with alpha=0.3 and max.tries>=200.
I added a polish search to some of the tours and it cleans the shape easily. I'm not sure what I should do with the polish, i.e. It cleans up the values that almost find the shape with 100% accuracy, but should I run it over every tour in the grid search to see how much it can fix ones that aren't close?
I'm now not sure what I could do next. I could repeat this process (although more efficiently) with the feature_vs_feature dataset and the multiplefeatures_vs_noise data set, but I feel like it will have a similar outcome to the feature_vs_noise data I have already done. I also could try different features to utilise different scagnostics, but the remaining scagnostics either have that MST binning problem (sparse or skewed) that makes the range of values super small, or they are the adjusted scagnostics (striated2 and clumpy2) which are either too computationally expensive (although they might not be on toms computer) or probably wont gradually increase in an intuitive way. On this note, I could try and work out some of the issues that need to be fixed to use the remaining scagnotics as PP indexes. I also thought I should transfer the current PP index function to the cassowaryr package, but that would only take 30mins haha. I also considered changing the number of noise variables to see how that effects the ability to find the shape.

harriet-mason / 2021_Summer_Research

General Questions I've Had #1