Open bwalsh opened 7 years ago
Document ongoing work deploying dcc-download-server
Steps thus far:
#
# Development
@@ -66,16 +72,16 @@ spring:
# Hadoop
hadoop:
properties:
- fs.defaultFS: file:///
+ fs.defaultFS: hdfs://10.60.60.55:8020
# Mongo
spring.data.mongodb:
- uri: mongodb://localhost/dcc-download
+ uri: mongodb://10.60.60.55:27017/dcc-download
# Spark Job configuration
job:
# The inputDir is configured to be used in Eclipse
- inputDir: ../dcc-download-server/src/test/resources/fixtures/input
+ inputDir: /bwalsh-release
on etl-2, create the expected results of dcc-release
# create new hdfs directory /bwalsh-release
# cp the contents of dcc-download-server/src/test/resources/fixtures/input into it
ubuntu@dcc-etl-2:~$ hdfs dfs -ls /bwalsh-release
Found 4 items
-rw-r--r-- 3 ubuntu hadoop 18 2017-02-10 22:47 /bwalsh-release/README.txt
drwxr-xr-x - ubuntu hadoop 0 2017-02-10 22:47 /bwalsh-release/legacy_releases
drwxr-xr-x - ubuntu hadoop 0 2017-02-10 22:47 /bwalsh-release/release_20
drwxr-xr-x - ubuntu hadoop 0 2017-02-10 22:47 /bwalsh-release/release_21
back on dcc-portal-blue, start the download server, with debug logging, using the 'development' profile (above)
$ pwd
/home/ubuntu/dcc-download/dcc-download-server
$ java -Dlogging.config=./src/test/resources/logback-test.xml -Dspring.profiles.active=development -jar target/dcc-download-server-4.3.12-SNAPSHOT.jar --spring.config.location=./src/main/resources/application.yml
To see the running process.
ps -ef | grep download
ubuntu 17978 1 0 Feb10 ? 00:02:16 java -Dlogging.config=./src/test/resources/logback-test.xml -Dspring.profiles.active=development -jar target/dcc-download-server-4.3.12-SNAPSHOT.jar --spring.config.location=./src/main/resources/application.yml
ubuntu 19256 19239 0 17:48 pts/0 00:00:00 grep download
Luckily, it appears even lacking a download
stanza in it's application.yml, the dcc-portal-server's defaults will point a a connection on localhost:9090.
As a result, we can see the contents of the hdfs file system in the browser.
Unfortunately, the process then errors because the mongo database is empty. See dcc-download
mongo connection string above.
It appears Mongo DB is empty. null ModelAndView
It is unclear if this is a blocker, I believe this database is used to store ad-hoc inprogress download requests.
2017-02-13 18:00:36,334 [http-nio-9090-exec-1] DEBUG o.s.d.m.c.MongoDbUtils - Getting Mongo Database name=[dcc-download]
2017-02-13 18:00:36,434 [http-nio-9090-exec-1] DEBUG o.s.w.s.m.m.a.RequestResponseBodyMethodProcessor - Written [UP {hdfs=UP {}, mongo=UP {version=3.2.9}, diskSpace=UP {total=42241163264, free=37514747904, threshold=10485760}}] as "application/json" using [org.springframework.http.converter.json.MappingJackson2HttpMessageConverter@6d205aa]
2017-02-13 18:00:36,435 [http-nio-9090-exec-1] DEBUG o.s.w.s.DispatcherServlet - Null ModelAndView returned to DispatcherServlet with name 'dispatcherServlet': assuming HandlerAdapter completed request handling
2017-02-13 18:00:36,435 [http-nio-9090-exec-1] DEBUG o.s.w.s.DispatcherServlet - Successfully completed request
2017-02-13 18:00:36,436 [http-nio-9090-exec-1] DEBUG o.s.b.c.w.OrderedRequestContextFilter - Cleared thread-bound request context: org.apache.catalina.connector.RequestFacade@3500fbc6
2017-02-13 18:00:36,439 [http-nio-9090-exec-1] DEBUG o.a.c.h.Http11NioProtocol - Socket: [org.apache.tomcat.util.net.NioEndpoint$KeyAttachment@34ca92c6:org.apache.tomcat.util.net.NioChannel@1aaa8110:java.nio.channels.SocketChannel[connected local=/127.0.0.1:9090 remote=/127.0.0.1:41796]], Status in: [OPEN_READ], State out: [OPEN]
Reading further, I think that the dcc-etl process needs to be called. For example, it seems that this code writes the expected README.txt file
Hi.
We've completed the dcc-release pipeline, at least as far as export.
At the same time, we have deployed an instance of dcc-download-server and have it communicating to the HDFS file system. We've manually populated a directory with the contents of dcc-download-server/src/test/resources/fixtures/input
. We've configured proxies and can now download data from HDFS successfully.
More on this here: https://github.com/ohsu-comp-bio/euler/issues/16
As far as I can tell the dcc-release process is incomplete, in that it does not create a directory structure within HDFS that dcc-download expects ( see https://github.com/ohsu-comp-bio/dcc-download/tree/develop/dcc-download-server#virtual-file-system ).
There is code here in dcc-etl that looks like it might create the expected directory ( https://github.com/icgc-dcc/dcc-etl/blob/3fb472e5b07adf90e925a76465783a8d3424ea19/dcc-etl-client/src/main/scripts/overarch/overarch.sh#L427 ).
Is there any guidance you can share on how to prepare data for handoff between dcc-release and dcc-download.
Thanks very much for reading.
-Brian Walsh
existing UI
example request
1) the portal uses standard \<a href="...." /> tags for download and manifests. cookies, not auth headers are used for authorization 2) those urls point to the /api/v1/download/, /api/v1/manifests endpoints in dcc-portal-server
high level flow
key: white:completed, orange:inprogress, green:new From Right to Left
downloads
Tentatively, this should 'just work' when [EUL-10] is complete and dcc-download deployed
manifests
Manifests have several formats see here
Although we are blocked on [EUL-10] for downloads I believe we may be able to move forward with manifests.
High level:
This has several advantages: