Closed GoogleCodeExporter closed 9 years ago
If caching is not working for us, let's strip it out. But is seems counter
intuitive to me that a client in repeated use can repeatedly download the same
files from the server faster than it can read them from from a local drive. It
is also intuitive that the lower the bandwidth the more effective caching would
be. It also seems clear that the first time a file is acquired from the server
there will be a penalty in writing the file to disk. Any idea what's going on
that produces these results?
Original comment by davlit0...@gmail.com
on 30 Jul 2010 at 5:06
data.wholebraincatalog.org is not running the server version that supports
caching.
Original comment by davlit0...@gmail.com
on 30 Jul 2010 at 5:18
I think there may be some confusion. When "client caching" is mentioned it is
referring to the user client, Whole Brain Catalog 0.7.7.x
I don't see how the server has any role in that client side caching.
Original comment by caprea
on 30 Jul 2010 at 6:40
The server role in client-side caching is to provide information as to the
currency of data files so that the client can verify that its cached version of
a data file is the correct version. The server does this by providing the SHA-1
value of the current data file version in the data wrapper.
It's a separate issue but the server is caching the entire database in its
memory.
Original comment by davlit0...@gmail.com
on 31 Jul 2010 at 4:18
Updating the issue with some text from some email threads:
From Chris:
Here's a summary of the 'Download Manager'
First, a bit of background;
DataWrappers have the method .getData() which is called at various
points, the data is NOT assumed to be on the client. If it IS, it is
fetched from the MemoryCache, if not, it uses the restlet to retrieve
the data before continuing (blocking operation, client freezes)
This is important because the download manager only effects startup
just yet. And if it fails, the restlet getData() will take over.
DownloadManager:
http://code.google.com/p/wholebrain/source/browse/wbc/trunk/wbc-client/src/main/
java/org/wholebrainproject/wbc/data/DownloadManager.java#365
The main execution is in the 'run' method loop. If it's not currently
in the MemoryCache, itll check the Disk for an appropriate file using
this DiskCache class:
http://code.google.com/p/wholebrain/source/browse/wbc/trunk/wbc-client/src/main/
java/org/wholebrainproject/wbc/data/DiskCacheRepository.java
if the file is found, it computes the SHA1 for that file and compares
it. Here is a potential spot for improvement but I think this is
unlikely to have a large impact.
If file is not found, it downloads the file contents using the method
primitiveDownload:
http://code.google.com/p/wholebrain/source/browse/wbc/trunk/wbc-client/src/main/
java/org/wholebrainproject/wbc/data/DownloadManager.java#125
It then downloads the content, byte by byte. When it is finished
downloading the content, it then reads in the content of the file and
converts it to JME. This is I think the best candidate for improvement
for two reasons.
1) Our restlet library might be using multiple connections at once, I
only use one.
2) it reads from net > writes to disk > read from disk > convert the
file > place it in memory cache
Instead, we could convert the data before writing it to disk, thus
saving the operation of reading it from the file.
Original comment by stephen....@gmail.com
on 12 Aug 2010 at 3:17
From Chris:
I did a test to check if computing the SHA1s by the client was having
much of an impact. To do this, I modified a conditional statement from
"if checksums are the same" to be "if file exists at all"
Effectively ignoring checksums and assuming that, if there was a file,
it was the right file. I found no significant increase in startup,
although I did not time this accurately. Yes, storing the SHA1 is more
efficient - but, this is not where the slowness is coming from.
Original comment by stephen....@gmail.com
on 12 Aug 2010 at 3:17
From Chris:
I did a controlled test on this. Here are the results, I used dev-data
server for all these tests.
Here is a measure of startup time in different scenarios. I did
multiple runs for each scenario and recorded the time. All times are
in seconds.
Scenario 1, cache flag is off
43, 38, 35
Scenario 2, NO content is on disk (the cache is empty):
87, 76
Scenario 3, Client calculates checksum of files on disk
68, 66, 71
Scenario 4, Client does NOT calculate checksum (assumed to be equal)
60, 65, 61
This demonstrates that calculating SHA1 by the client each time costs
~5 seconds, this is minimal compared to the the biggest deficit (~28s)
which seems to come from using caching at all. I can't attribute all
of this loss from the download routine, because in the scenario where
all files are on disk, it still takes longer.
Original comment by stephen....@gmail.com
on 12 Aug 2010 at 3:17
How is this looking after r2859 ?
Original comment by stephen....@gmail.com
on 12 Aug 2010 at 3:20
As projected in my previous comment, there has been negligible improvement (but
improvement neverless)
Original comment by caprea
on 12 Aug 2010 at 5:53
Ok. What's needed to help you identify the bulk of the slowness so that you
can focus your time on solving the problem that is causing the majority of the
slowness? How can I help?
Original comment by stephen....@gmail.com
on 12 Aug 2010 at 5:58
I have reworked the method which is actually downloading the content.
The test case that tests the performance of downloading used to take 20s for
the test now it accomplishes the same task in only 5s. The actual mechanism for
downloading is much faster but the startup time is not improved. The improved
code is not making simultaneous connections, so additional room for
improvement here, but I dont think its warranted. I believe all the slowness of
loading comes from loading/converting files from their format to JME.
It's no mystery to me, I think the slowest operation is reading the files from
disk and converting them from their format (OBJ,XML) into JME. I think this
method (and its children) is the bottleneck:
http://code.google.com/p/wholebrain/source/browse/wbc/trunk/wbc-client/src/main/java/org/wholebrainproject/wbc/data/DiskCacheRepository.java#94
Original comment by caprea
on 13 Aug 2010 at 12:08
I attached a CSV spreadsheet from a run of WBC. This is only listing the time
it takes for each read/conversion of a file on disk. In total, this accounts
for 41s of startup time.
Original comment by caprea
on 13 Aug 2010 at 12:44
Attachments:
I realized my print statement precedes the file name, so all those values
should be shifted down one. If you're wondering why one file takes so long,
look at the name of the file succeeding the reported time.
Original comment by caprea
on 13 Aug 2010 at 1:37
This sounds like an excellent observation. If I'm following it, it may offer
the prospect of caching the converted JME object instead of the obj or xml
file. This would seem doable if the object implements java.io.Serialable. The
cached SHA-1 could continue to be used except it's value would be that of the
obj or xml file which exists on the server. (Ignore this comment if I'm off
base.)
Original comment by davlit0...@gmail.com
on 13 Aug 2010 at 1:58
I agree with Dave, and Stephen has mentioned it before. The way I envision it,
we would simply change the datawrappers to point to .jme files instead of OBJs
or XML and then just work with those. The server would be responsible for
converting and then the user never sees those original files again.
However, this would break the "Download as XML" feature unless we had 'datum'
elements for DataWrappers. Alternatively, we could just point to the original
file in an RDF annotation (seems like less work this way)
Original comment by caprea
on 13 Aug 2010 at 2:36
Just to tie together some of these points, issue 382 is about having the
multiple data elements for datawrappers. I think we want to do this to avoid
breaking the download features. However, since this is a pretty core feature,
it will take some care to implement issue 382 without introducing bugs.
Chris, would you like to represent the argument against multiple datum elements
for DataWrappers, if nothing else for the sake of discussion?
Original comment by stephen....@gmail.com
on 15 Aug 2010 at 12:11
Also, issue 238 is about transferring files as JME binary from the server.
Original comment by stephen....@gmail.com
on 15 Aug 2010 at 12:12
After running 200+ tests of comparing downloading the same file and importing
it,
Average time for Disk: 246
Average time for Memory: 248
Giving memory every advantage (needed to create less objects since it was NEVER
written to file). Any difference of this magnitude (~10 ms or less) could be
attributed to error in network latency fluctuations, I feel.
Original comment by caprea
on 2 Oct 2010 at 6:42
This is not an issue in the web version
Original comment by caprea
on 15 Dec 2010 at 12:51
Original issue reported on code.google.com by
stephen....@gmail.com
on 28 Jul 2010 at 7:55