malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.
MIT License
14 stars 13 forks source link

Allow for flexible columns in manifest for vector alignment genotyping pipeline #71

Open tnguyensanger opened 3 years ago

tnguyensanger commented 3 years ago

Addresses #70

NB: this will fail miniwdl validation. But it passes womtool validation and runs on cromwell server v49. I suspect it's because miniwdl doesn't support read_objects() WDL function from WDL v1.0, but not 100% sure.

https://github.com/malariagen/pipelines/blob/5d26c17c70a5edce00c0ad8a65beda841428ec87/pipelines/import-short-read-lanelet-alignment-vector/farm5/ImportShortReadLaneletAlignment.wdl#L36

The preferred approach is to get miniwdl to recognize the WDL as valid, whether that means to modify the WDL or miniwdl settings. Another approach is to switch github actions to use womtool instead of miniwdl for validation. However, womtool syntax validation is not nearly as thorough as miniwdl.

tnguyensanger commented 3 years ago

Surfacing communications with @gbggrant. TL;DR: Try using a Struct instead of Object.

From: George Grant ggrant@broadinstitute.org Date: Thursday, 10 December 2020 at 13:39 To: Thuy Nguyen tn6@sanger.ac.uk Subject: Re: Advice on miniwdl [EXT]

Hi Thuy,

Here's what I got back from Mike Lin (dna@mlin.net):

Hi George, yea we've so far punted on this since the miniwdl project started shortly after the OpenWDL group moved towards deprecating Object. The object{} literal syntax is there only for building structs. It could be finished if it really unlocked some high-value use case, but otherwise we were hoping to get away with this approach. Is there anything insufficient with the newer structs that prevent using them here? (Of course I understand it's not always possible to invest time in fixing what ain't broke)

Not 100% sure I understand what he's suggesting - but it sounds like you should be able to replace Object with a Struct. I'll forward on the email chain to you if you want to follow up directly with him.

George

tnguyensanger commented 3 years ago

I tried replacing Array[Object] with Array[Map[String, String]] and got the following error:

$ miniwdl check ~/gitrepo/pipelines/releases/BatchImportShortReadAlignmentAndGenotyping/BatchImportShortReadAlignmentAndGenotyping_0.1.0.wdl
(/Users/tn6/gitrepo/pipelines/releases/BatchImportShortReadAlignmentAndGenotyping/BatchImportShortReadAlignmentAndGenotyping_0.1.0.wdl Ln 16 Col 1) Failed to import ImportShortReadAlignment.wdl
(ImportShortReadAlignment.wdl Ln 21 Col 1) Failed to import ImportShortReadLaneletAlignment.wdl
(ImportShortReadLaneletAlignment.wdl Ln 36 Col 46) No such function: read_objects
      Array[Map[String, String]] lanelet_infos = read_objects(per_sample_manifest_file)

I also tried using Array[struct]. I have to admit I have little experience with wdl structs, so I’m not sure if I used the correct syntax. However, I still get the same read_objects error.

$  miniwdl check ~/gitrepo/pipelines/releases/BatchImportShortReadAlignmentAndGenotyping/BatchImportShortReadAlignmentAndGenotyping_0.1.0.wdl
(/Users/tn6/gitrepo/pipelines/releases/BatchImportShortReadAlignmentAndGenotyping/BatchImportShortReadAlignmentAndGenotyping_0.1.0.wdl Ln 16 Col 1) Failed to import ImportShortReadAlignment.wdl
(ImportShortReadAlignment.wdl Ln 21 Col 1) Failed to import ImportShortReadLaneletAlignment.wdl
(ImportShortReadLaneletAlignment.wdl Ln 37 Col 35) No such function: read_objects
      Array[Manifest] lanelet_infos = read_objects(per_sample_manifest_file)
                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(ImportShortReadLaneletAlignment.wdl Ln 40 Col 25) Not an array
        String irods_path = lanelet_infos[idx][LANELET_INFO_COLNAME_IRODS_PATH]
                            ^^^^^^^^^^^^^^^^^^

Where Manifest struct is defined in Manifest.wdl:

version 1.0

struct Manifest {
  String sample_id
  String irods_path
}

And imported by ImportShortReadLaneletAlignment.wdl:

import "Manifest.wdl"
tnguyensanger commented 3 years ago

More communications with Mike Lin, developer of Miniwdl. Tl;dr: try miniwdl v0.10.0: https://github.com/chanzuckerberg/miniwdl/releases/tag/v0.10.0

From: Mike Lin dna@mlin.net Date: Monday, 4 January 2021 at 06:45 To: Thuy Nguyen tn6@sanger.ac.uk Cc: George Grant ggrant@broadinstitute.org Subject: Re: Validation error from miniwdl [EXT]

Dear Thuy, the new miniwdl v0.10.0 [github.com] implements read_object() and read_objects() but only to load Map[String,String] (which can in turn be coerced to a struct type) -- it still lacks general Object support. I hope this helps you, any feedback welcome! Mike

On Tue, Dec 15, 2020 at 8:34 PM Thuy Nguyen tn6@sanger.ac.uk wrote: Hi

We quite like miniwdl for wdl validation. It’s easy to use and is very good at catching syntax issues. Adding read_objects() support is most preferable for us. What is your rough estimate for when such support could be available? I realize we have been discussing all of this quite informally. I’m happy to create an official feature request in the github if it’s easier.

Many thanks! Thuy

From: Mike Lin dna@mlin.net Date: Monday, 14 December 2020 at 23:11 To: Thuy Nguyen tn6@sanger.ac.uk Cc: George Grant ggrant@broadinstitute.org Subject: Re: Validation error from miniwdl [EXT]

Dear Thuy, sorry for these problems -- miniwdl's support for Object is incomplete for the reasons I mentioned with George below, and read_object() and read_objects() in particular are absent. We can work with you to converge on a solution though. • If you're open to moving towards structs, read_json() should currently work (but would require the manifest to be formatted as JSON instead of TSV) • We could add read_object() and read_objects() only for the purpose of initializing structs from TSV • Fully completing Object support is the least-preferred option for us for the reasons mentioned below but it's possible if it really unblocks something critical Let me know what you think, any other feedback welcome too!

Hi George, yea we've so far punted on this since the miniwdl project started shortly after the OpenWDL group moved towards deprecating Object. The object{} literal syntax is there only for building structs. It could be finished if it really unlocked some high-value use case, but otherwise we were hoping to get away with this approach. Is there anything insufficient with the newer structs that prevent using them here? (Of course I understand it's not always possible to invest time in fixing what ain't broke)

On Mon, Dec 14, 2020 at 11:50 AM Thuy Nguyen tn6@sanger.ac.uk wrote: Hi Mike

I’m the developer that George was referring to earlier. I hope you and George don’t mind if I contact you directly. I’m wondering if read_objects() from WDL v1.0 is supported by miniwdl?

Our use case is to read in a TSV with a header such that the columns can be in any order. I wasn’t sure of any other way in WDL v1.0 to do this other than Array[Objects] arr_of_row_objects = read_objects(tsv_with_header). However, I’m definitely up for alternatives.

I tried replacing Array[Object] with Array[Map[String, String]] and got the following error:

$ miniwdl check ~/gitrepo/pipelines/releases/BatchImportShortReadAlignmentAndGenotyping/BatchImportShortReadAlignmentAndGenotyping_0.1.0.wdl (/Users/tn6/gitrepo/pipelines/releases/BatchImportShortReadAlignmentAndGenotyping/BatchImportShortReadAlignmentAndGenotyping_0.1.0.wdl Ln 16 Col 1) Failed to import ImportShortReadAlignment.wdl (ImportShortReadAlignment.wdl Ln 21 Col 1) Failed to import ImportShortReadLaneletAlignment.wdl (ImportShortReadLaneletAlignment.wdl Ln 36 Col 46) No such function: read_objects Array[Map[String, String]] lanelet_infos = read_objects(per_sample_manifest_file)

I also tried using Array[struct]. I have to admit I have little experience with wdl structs, so I’m not sure if I used the correct syntax. However, I still get the same read_objects error.

$ miniwdl check ~/gitrepo/pipelines/releases/BatchImportShortReadAlignmentAndGenotyping/BatchImportShortReadAlignmentAndGenotyping_0.1.0.wdl (/Users/tn6/gitrepo/pipelines/releases/BatchImportShortReadAlignmentAndGenotyping/BatchImportShortReadAlignmentAndGenotyping_0.1.0.wdl Ln 16 Col 1) Failed to import ImportShortReadAlignment.wdl (ImportShortReadAlignment.wdl Ln 21 Col 1) Failed to import ImportShortReadLaneletAlignment.wdl (ImportShortReadLaneletAlignment.wdl Ln 37 Col 35) No such function: read_objects Array[Manifest] lanelet_infos = read_objects(per_sample_manifest_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (ImportShortReadLaneletAlignment.wdl Ln 40 Col 25) Not an array String irods_path = lanelet_infos[idx][LANELET_INFO_COLNAME_IRODS_PATH] ^^^^^^^^^^^^^^^^^^

Where Manifest struct is defined in Manifest.wdl:

version 1.0

struct Manifest { String sample_id String irods_path }

And imported by ImportShortReadLaneletAlignment.wdl:

import "Manifest.wdl"

Any suggestions on how to get past the “No such function: read_objects” error?

Many thanks! Thuy Nguyen Senior Bioinformatician Malaria Programme Wellcome Sanger Institute