Closed vazexqi closed 13 years ago
This is how this situation happened (and it is quite easy to reproduce).
In my case it was because I had accidentally deleted my working copy and had to restore from an older backup. So this is a very real scenario that could happen.
@vazexqi said:
Did you anticipate such a scenario? It seems that it might be quite possible especially if the user likes to take daily backups to his own HDD (not to CVS/SVN) and restore from the backups when things go wrong (like he is reverting).
Can we solve this by always forcing an
svn update
to the edu.illinois.codingspectator.data directory before anything else loads? That way, even, if the user restores from backup, the first time Eclipse starts up, it will load the latest data available on the server (using svn update) and ensure that the restored backup is updated to reflect new changes.The consequence of this is that you will see in CodingTracker that has been restored locally might contain less changes that what was previously recorded on the server.
@Wanderer777 said:
Thank you for letting me know about this scenario. No, I did not think about such scenario, but except the update problem (SVN exception), it would not break the data collected by CodingTracker. Simply some part of the data would be missing, but that would naturally reflect the fact that a developer lost the corresponding part of his/hers changes to code.
So, I do not see a problem if we lose some data, as long as the recorded operation sequence corresponds to the actual code (i.e. we can replay the collected data contiguously). A really bad scenario would be to get some data that does not match the actual state of a developer's workspace after the restore. This would essentially break an operation sequence in the middle, making it unreplayable.
@vazexqi and @Wanderer777:
How can CodingSpectator do an svn update
before any data is written into the watched directory? How can CodingSpectator ensure that the svn update
operation happens before any writes to the watched directory?
@vazexqi and @Wanderer777:
An intrusive way to resolve this issue is to avoid writing directly into the watched directory. We can make every data writer write into a place out of the watched directory and only copy the data to the watched directory during preSubmit
. This would allow us to perform an svn update
operation before preSubmit
, i.e. before changing the state of the watched directory. As I said, this approach is intrusive because it requires changing the way a lot of data writers work. So, I suggest to invoke the svn update
operation before anything is written in the watched directory. It's crucial that svn update
be invoked before the state of the watched directory is changed, or, conflicts may happen. At the moment, I don't know of a good place to invoke the svn update
operation at.
@Wanderer777 and I discussed this issue today. @Wanderer777 explained to me that it isn't possible to do an svn update
operation before anything is written into the watched directory, because svn update
requires authentication and the user could perform some actions before the authentication is complete. Therefore, we're going to implement the harder solution, i.e. we are going to use a directory other than the watched directory as the main storage and copy the data over to the watched directory after an svn update
operation and before uploading data. @Wanderer777 is going to update SafeRecorder
to use a directory other than the watched directory for its initial storage. @vazexqi and I will have to move the refactoring logs of CodingSpectator out of the watched directory and copy them over to the watched directory just before the upload. This will result in a mechanism similiar to the one we are employing for the refactorings captured by Eclipse.
Let's make .metadata/.plugins/edu.illinois.codingspectator.data/watched
the path to the SVN working copy.
@reprogrammer and @vazexqi:
I am somewhat concerned about the copying step of our tentative solution. The file that keeps data recorded by CodingTracker can grow substantially during the study time-frame, and may reach several hundreds of megabytes at the end of the study. Copying such files will be quite time consuming. So, I wonder if it is worth to solve a potential problem of 1% of participants, while other 99% will suffer some overhead.
I am thinking if there is a way to avoid copying our files. The problem essentially arises when SVN considers that a user who previously uploaded let's say version v2, now tries to upload version v1 (after the restore) plus some changes. I wonder how much control we have over SVN behavior in the SVNKit. For example, would we be able to trick SVN into believing that it is going to submit v2+changes2 rather than v1 + changes1? I.e. instead of doing an update, we could try to modify local SVN files to contain the latest uploaded version number (and maybe the latest uploaded file content as well, if it is used by SVN to compute the diff), which we could ask from the server before each upload.
I've opened issue #287 for storing the descriptors captured by CodingSpectator outside the watched folder.
@Wanderer777: How does safe recorder version the log files? Let's say CodingSpectator version x
writes something into refactoring-problems.log
using safe recorder but the user doesn't submit the data until he/she upgrades to CodingSpectator version y
. Where will the write to refactoring-problems.log
end up, x/refactoring-problems.log
or y/refactoring-problems.log
?
@reprogrammer: This is a very important concern, thank you for raising it. Safe recorder always writes files to the current version of CodingSpectator. And then, before the commit, it copies the recorded files of the current version to the current version inside the watched directory. As a result, in the scenario that you described above the content of x/refactoring-problems.log
will never get uploaded. To avoid this loss of data, we might need to make safe recorder look for files that remained from previous versions and copy them to their corresponding versions inside the watched directory.
@reprogrammer: Please review my last two commits. In particular, changes to class org.eclipse.epp.usagedata.internal.recording.uploading.UploadManager
, where I merged the former preLock
and preSubmit
methods.
@Wanderer777: The two commits bc9f8b6c46e8f7c5f755c130a58f746d893980fc and b584680f16c631d909c0d11e22c0a3c131022ed3 look good to me. Thank you.
@Wanderer777: I performed the following steps to test the new SafeRecorder
:
When I inspected the .metadata/.plugins
folder, I noticed that the codechanges.txt
had been moved from .metadata/.plugins/edu.illinois.codingspectator.saferecorder/1.0.0.qualifier/codingtracker/codechanges.txt
to .metadata/.plugins/edu.illinois.codingspectator.data/1.0.0.qualifier/codingtracker/codechanges.txt
. But, refactoring-problems.log
was still in .metadata/.plugins/edu.illinois.codingspectator.saferecorder/1.0.0.qualifier/refactorings/refactoring-problems.log
. Do you know why refactoring-problems.log
didn't get moved to the watched folder?
@Wanderer777:
CodingSpectator checks out the latest version of the participant's folder after the participant successfully authenticates (See Submitter#authenticateAndInitialize). If a working copy does not exist, this command checks it out. But, if a working copy exists, this commands updates it. It's good that CodingSpectator updates the working copy right after authentication and before anything new is written to the watched folder. But, there are still cases in which a conflict may occur.
If the participant reverts the workspace to an older reversion and manually modifies a file in the watched folder, CodingSpectator will report an SVN conflict the next time it tries to upload its data.
There are two possible strategies to resolve this conflict. We can have the SVN client resolve the conflict by automatically taking the local or remote copy of the conflicting file. I'd rather take the remote one since that's more recent. Which one do you prefer?
@reprogrammer: I also would prefer to use the remote copy. The local copy would contain manual modifications, which essentially breaks our log: if we take it instead of the remote copy and then append to it whatever was accumulated in the storage location, most probably we would get an unreplayable sequence. So, taking the remote copy is safer, though if the user manually modifies a log file in the storage location, the recorded sequence would be broken anyway.
@Wanderer777: As I described in my previous comment, there is a possibility of conflict when updating a working copy. Therefore, we have to be prepared to deal with such conflicts.
I changed directory to .metadata/.plugins/edu.illinois.codingspectator.data
and executed the svn up
command. As a result, I got a conflict. Then, I issued the command svn resolve --accept=theirs-full -R .
The svn resolve
command resolved all the conflicts of the working copy by using the "theirs-full" strategy.
Fortunately, the SVNWCClient#doResolve
method in SVNKit supports the svn resolve
command.
Now that I've found a way to problematically resolve the conflicts, we no longer need to move our log files outside the watched folder. In other words, we can always do an svn update
followed by svn resolve --accept=mine-full -R .
.
We should discuss the advantages and disadvantages of buffering the writes to the watched folder.
@Wanderer777: I prefer the see all the data in the latest revisions committed to the repository. This frees us from the burden of exploring the history of the SVN repository. So, I'd still prefer to use the "THEIRS_FULL" strategy to resolve the conflicts (See 76b58b7bb44376e22f32c974d4187a201c699016). The combination of the "THEIRS_FULL" strategy and not writing to the watched folder before updating the working copy makes it possible to retain the existing data on the repository and append the new data to it. I like this approach because all of the data that CodingSpectator captures will exist in the latest revision of the repository. I think this justifies the need to buffer the writes to the watched folder until the working copy is updated and its conflicts are resolved. What do you think?
I tested my SVN resolution mechanism when the authentication information is obtained from either the secure storage or dialog.
@reprogrammer: I agree with you that seeing all the data in the latest repository revisions is a good motivation to buffer the writes to the watched folder. At the same time, if the CodingTracker's main record file undergoes such scenario, the resulting operation sequence most probably would not be replayable. But if we use "MINE_FULL" strategy, although we would lose some data due to workspace restoration, the resulting CodingTracker sequence would be replayable (of course, unless the participant performed manual modifications in it). Also, this strategy makes buffering the writes to the watched folder redundant.
@reprogrammer:
In response to your previous comment:
SafeRecorder
treats all record files uniformly, which means that it should not have specific problems with some particular record files. Also, I tried it with multiple record files at once (e.g. codechanges.txt
and error.log
) and it worked fine for me. Please note that SafeRecorder
will not delete a storage file unless both its reading and writing completed successfully (if not, the file will be preserved for the next upload). In your scenario refactoring-problems.log
did not disappear from the storage location, but did it appear in the watched directory? Also, did you notice any problems in the error log due to file reading/writing? What happens if you upload again, is refactoring-problems.log
moved this time?
@Wanderer777: As you said, the "MINE_FULL" strategy will keep the CodingTracker sequence replayable. I'm fine with using the "MINE_FULL" strategy. Should we go ahead and revert all of our changes that buffer the writes into the watched folder then?
@reprogrammer:
It looks that buffering the writes into the watched folder makes CodingSpectator more robust (see issue #308), so lets keep it unless we find that it is problematic or absolutely redundant.
It looks that there is a scenario, in which even the "MINE_FULL" strategy would not allow to avoid making CodingTracker sequence unreplayable. Let' say the latest restoration point of Eclipse coincides with the latest upload. As a result, there will not be any conflicts during svn update
and the locally accumulated operations will be appended to the server's version of the file (which does not reflect the current state of the workspace), therefore the resulting operation sequence would not be replayable.
We need to make sure that the user's record file is the file that we always consider for upload. I do not know if there is an SVN command that would allow to achieve this, but we could try to simulate the conflict ourselves by always writing something into the file before svn update
. Or we could rename the watched file before performing svn update
, then perform the update and delete the updated file, renaming back the original file, and thus preserving user's operation sequence for the upload. What do you think?
@Wanderer777: Good catch! I had missed this scenario. The solution to this problem should apply to all files in the watched folder not just the CodingTracker log.
@Wanderer777, @vazexqi:
The issues (#257 and #309) we have run into make me think if we are doing something fundamentally wrong. I guess it wasn't a good idea to rely on the version history of Subversion from the beginning. How about storing every upload in a new location? What if we store the data that is uploaded at time t
at <Subversion Repository>/<username>/<UUID>/<t>
? I think if we commit the data into a fresh path, Subversion will take the data from the working copy without raising any conflicts. Do you know if this approach will be too expensive? In other words, which is more expensive, saving the revisions in the history of Subversion or different paths?
@reprogrammer: I think that storing each upload in a separate path is a good idea. This would allow us to avoid any SVN problems and get rid of the long running (and thus, error-prone) moving operation in SafeRecorder
, while storing the data in an easy to access and process way. I would assume that storing each revision in a separate path is more expensive. But we can implement some kind of a daemon process on the server side, which would merge separate paths into a single path with multiple revisions. What concerns me is the time that it would take a participant to upload his/her data to the server. At the end of the study, the main log of CodingTracker could be really big. Or do you suggest to upload only deltas (i.e. delete the uploaded content after a successful upload)?
@Wanderer777: I won't worry that much about the space overhead on the server side. But, as you mentioned, if we commit the data into a new path, we will have to transfer the whole data rather than just the delta. This huge data transfer will hurt the user experience, specially if it's frequent. So, we have to think of ways to minimize the overhead of data transfer. We don't have to commit the data into a fresh location in every submission. How can we detect when we really need to commit to a fresh path? Is it sufficient to compare the revision of the local working copy with that of the remote one to decide if we need to commit to a fresh path?
@reprogrammer: I think that we indeed can compare the local revision number vs. the server revision number to decide whether we need to upload data to a new path. We know for sure that if the revisions are the same, we can upload data without a risk of a conflict. If they are different, we start to upload to a new path. Such a check has to be done for all files since it is sufficient to have a single file with not matched revisions to produce a conflict. I would expect that starting to upload to a new path would be needed only for a few participants that practice restoring their Eclipse workspaces. Such participants will experience longer uploads the first time we start writing to a new path. Also, it looks that this way we would not need to buffer files from the watched directory.
@Wanderer777: There are a few questions we need to answer to better understand the cons and pros of committing to fresh locations. I've opened issue #311 to discuss the advantages and disadvantages of committing to fresh locations.