Download discussion - Githubissues

bwalsh commented 7 years ago

existing UI

example request

1) the portal uses standard \<a href="...." /> tags for download and manifests. cookies, not auth headers are used for authorization 2) those urls point to the /api/v1/download/, /api/v1/manifests endpoints in dcc-portal-server

high level flow

key: white:completed, orange:inprogress, green:new From Right to Left

release process populates hdfs and mongo [EUL-10]
dcc-download is deployed and configured connections to hdfs & mongo
dcc-portal-server deployment updated to connect to dcc-download
existing /v1/manifest & /v1/download should work
/v1/manifest for exacloud intercepted and resolved by euler.proxy (alternative change dcc-portal-server)
cookie added to dcc-portal-ui so euler.proxy can authorize

downloads

Tentatively, this should 'just work' when [EUL-10] is complete and dcc-download deployed

manifests

Manifests have several formats see here

Although we are blocked on [EUL-10] for downloads I believe we may be able to move forward with manifests.

High level:

As mentioned above there are multiple manifests formats. For exacloud , euler.proxy would intercept that call, use the filter to query the /v1/repository/files endpoint to retrieve file information
Formulate a shell mainfest along the lines of EGA /templates/manifest.ega.sh.template. That manifest would consist of exacloud friendly scp commands as is currently in use

This has several advantages:

the 'big data' of BAML stays where it is
leverages existing exacloud access methods & security to that data

bwalsh commented 7 years ago

Document ongoing work deploying dcc-download-server

Steps thus far:

fork of icgc/dcc-download to ohsu-comp-bio/dcc-download
clone of ohsu-comp-bio/dcc-download to /home/ubuntu/dcc-download on 10.60.60.59 dcc-portal-blue
modify src/main/resources/application.yml

#
 # Development
@@ -66,16 +72,16 @@ spring:
 # Hadoop
 hadoop:
   properties:
-    fs.defaultFS: file:///
+    fs.defaultFS: hdfs://10.60.60.55:8020

 # Mongo
 spring.data.mongodb:
-  uri: mongodb://localhost/dcc-download
+  uri: mongodb://10.60.60.55:27017/dcc-download

 # Spark Job configuration
 job:
   # The inputDir is configured to be used in Eclipse
-  inputDir: ../dcc-download-server/src/test/resources/fixtures/input
+  inputDir: /bwalsh-release

on etl-2, create the expected results of dcc-release

# create new hdfs directory /bwalsh-release
# cp the contents of dcc-download-server/src/test/resources/fixtures/input into it
ubuntu@dcc-etl-2:~$ hdfs dfs -ls /bwalsh-release
Found 4 items
-rw-r--r--   3 ubuntu hadoop         18 2017-02-10 22:47 /bwalsh-release/README.txt
drwxr-xr-x   - ubuntu hadoop          0 2017-02-10 22:47 /bwalsh-release/legacy_releases
drwxr-xr-x   - ubuntu hadoop          0 2017-02-10 22:47 /bwalsh-release/release_20
drwxr-xr-x   - ubuntu hadoop          0 2017-02-10 22:47 /bwalsh-release/release_21

back on dcc-portal-blue, start the download server, with debug logging, using the 'development' profile (above)

$ pwd
/home/ubuntu/dcc-download/dcc-download-server
$ java -Dlogging.config=./src/test/resources/logback-test.xml   -Dspring.profiles.active=development   -jar target/dcc-download-server-4.3.12-SNAPSHOT.jar   --spring.config.location=./src/main/resources/application.yml

To see the running process.

 ps -ef | grep download
ubuntu   17978     1  0 Feb10 ?        00:02:16 java -Dlogging.config=./src/test/resources/logback-test.xml -Dspring.profiles.active=development -jar target/dcc-download-server-4.3.12-SNAPSHOT.jar --spring.config.location=./src/main/resources/application.yml
ubuntu   19256 19239  0 17:48 pts/0    00:00:00 grep download

bwalsh commented 7 years ago

update monday 2/13

Luckily, it appears even lacking a download stanza in it's application.yml, the dcc-portal-server's defaults will point a a connection on localhost:9090.

As a result, we can see the contents of the hdfs file system in the browser.

blocker?

Unfortunately, the process then errors because the mongo database is empty. See dcc-download mongo connection string above.

It appears Mongo DB is empty. null ModelAndView

It is unclear if this is a blocker, I believe this database is used to store ad-hoc inprogress download requests.

2017-02-13 18:00:36,334 [http-nio-9090-exec-1] DEBUG o.s.d.m.c.MongoDbUtils - Getting Mongo Database name=[dcc-download]
2017-02-13 18:00:36,434 [http-nio-9090-exec-1] DEBUG o.s.w.s.m.m.a.RequestResponseBodyMethodProcessor - Written [UP {hdfs=UP {}, mongo=UP {version=3.2.9}, diskSpace=UP {total=42241163264, free=37514747904, threshold=10485760}}] as "application/json" using [org.springframework.http.converter.json.MappingJackson2HttpMessageConverter@6d205aa]
2017-02-13 18:00:36,435 [http-nio-9090-exec-1] DEBUG o.s.w.s.DispatcherServlet - Null ModelAndView returned to DispatcherServlet with name 'dispatcherServlet': assuming HandlerAdapter completed request handling
2017-02-13 18:00:36,435 [http-nio-9090-exec-1] DEBUG o.s.w.s.DispatcherServlet - Successfully completed request
2017-02-13 18:00:36,436 [http-nio-9090-exec-1] DEBUG o.s.b.c.w.OrderedRequestContextFilter - Cleared thread-bound request context: org.apache.catalina.connector.RequestFacade@3500fbc6
2017-02-13 18:00:36,439 [http-nio-9090-exec-1] DEBUG o.a.c.h.Http11NioProtocol - Socket: [org.apache.tomcat.util.net.NioEndpoint$KeyAttachment@34ca92c6:org.apache.tomcat.util.net.NioChannel@1aaa8110:java.nio.channels.SocketChannel[connected local=/127.0.0.1:9090 remote=/127.0.0.1:41796]], Status in: [OPEN_READ], State out: [OPEN]

bwalsh commented 7 years ago

Reading further, I think that the dcc-etl process needs to be called. For example, it seems that this code writes the expected README.txt file

bwalsh commented 7 years ago

Hi.

We've completed the dcc-release pipeline, at least as far as export.

At the same time, we have deployed an instance of dcc-download-server and have it communicating to the HDFS file system. We've manually populated a directory with the contents of dcc-download-server/src/test/resources/fixtures/input. We've configured proxies and can now download data from HDFS successfully. More on this here: https://github.com/ohsu-comp-bio/euler/issues/16

As far as I can tell the dcc-release process is incomplete, in that it does not create a directory structure within HDFS that dcc-download expects ( see https://github.com/ohsu-comp-bio/dcc-download/tree/develop/dcc-download-server#virtual-file-system ).

There is code here in dcc-etl that looks like it might create the expected directory ( https://github.com/icgc-dcc/dcc-etl/blob/3fb472e5b07adf90e925a76465783a8d3424ea19/dcc-etl-client/src/main/scripts/overarch/overarch.sh#L427 ).

Is there any guidance you can share on how to prepare data for handoff between dcc-release and dcc-download.

Thanks very much for reading.

-Brian Walsh

ohsu-comp-bio / euler

Download discussion #16

existing UI

example request

high level flow

downloads

manifests

update monday 2/13

blocker?