malariagen / fits

File tracking system for group DK
0 stars 0 forks source link

MLWH/Subtrack-only import #62

Open magnusmanske opened 5 years ago

magnusmanske commented 5 years ago

As per suggestion I have written and run code to pull data only from MLWH and Subtrack, and not from iRODs/baton. See this code branch.

Information that is in the FITS production database that I can not find in MLWH or Subtrack (except data from Solaris, which I disregard here) is:

Igonring BAM/CRAM duplicate files, as well as "phix|human|.csv|.json|.vcf" files, the new database (which builds in hours rather than days) misses 3102 files that are in the FITS production database. They cluster as:

So I can, in principle at least, re-generate FITS from first principles. This also works well for updates (a few seconds). If you agree that this is the way forward, I will work on a mechanism to snapshot/import data from Solaris, in a documented fashion.

As a sidenote, a mapping from MLWH/Subtrack tables (or rather, query results, which is similar but not identical) in JSON is used for this.

alimanfoo commented 5 years ago

Many thanks Magnus, this sounds promising. Couple of comments inline...

As per suggestion I have written and run code to pull data only from MLWH and Subtrack, and not from iRODs/baton. See this code branch.

Information that is in the FITS production database that I can not find in MLWH or Subtrack (except data from Solaris, which I disregard here) is:

  • CRAM reference (file path)

This is file path to the reference genome, right? As we'll be remapping, this doesn't seem necessary.

  • alignment filter

Don't know what that is, so I guess we don't need it.

  • sequenscape library name

I think we need some kind of library identifier, but IIRC we're using mlwh_legacy_library_id so as long as we still have that then we're good.

Not sure if we desparately need any of those. If we do, or if you know where to find this data, please let me know. This resolves Issue 58.

Igonring BAM/CRAM duplicate files, as well as "phix|human|.csv|.json|.vcf" files, the new database (which builds in hours rather than days) misses 3102 files that are in the FITS production database. They cluster as:

  • Some human files owned by Richard Durbin (e.g. ILB_WGS_low_coverage_sequencing_of_Chagga_from_Tanzania); should we track those?
  • "Malaria Programme R&D" and "Team 112 R&D" (missing a flowcell ID in iseq_product_metrics, investigating)
  • "The Malaria Cell Atlas: a comprehensive reference of the Plasmodium life cycle using single-cell RNA-seq" owned by Mara; should we track those?
  • Loose change

So I can, in principle at least, re-generate FITS from first principles. This also works well for updates (a few seconds). If you agree that this is the way forward, I will work on a mechanism to snapshot/import data from Solaris, in a documented fashion.

FWIW that sounds good to me, although it would be great to sit with you at some point and talk through in more detail. In particular, it would be very helpful to understand the sequence in which data from different sources are imported, and what data are imported on initial build versus what gets imported during incremental updates. Don't feel you have to explain that here though, probably easier f2f.

podpearson commented 5 years ago

Many thanks Magnus, this sounds promising. Couple of comments inline...

As per suggestion I have written and run code to pull data only from MLWH and Subtrack, and not from iRODs/baton. See this code branch. Information that is in the FITS production database that I can not find in MLWH or Subtrack (except data from Solaris, which I disregard here) is:

  • CRAM reference (file path)

This is file path to the reference genome, right? As we'll be remapping, this doesn't seem necessary.

Agree this doesn't seem necessary

  • alignment filter

Don't know what that is, so I guess we don't need it.

Also don't know what this is, so I agree we don't need it.

  • sequenscape library name

I think we need some kind of library identifier, but IIRC we're using mlwh_legacy_library_id so as long as we still have that then we're good.

I think Jim used mlwh.iseq_flowcell.legacy_library_id for Pf6.1, however, the library IDs Jim used in the Pf 6.0 manifest looked more like mlwh.iseq_flowcell.id_library_lims. I'd be inclined to include both of these for now. @sclaugoncalves and I have an outstanding action in SIMS to create a diagram of data flows within Sanger core systems, so I'm hoping that at some point I might better understand what these two fields mean!

Not sure if we desparately need any of those. If we do, or if you know where to find this data, please let me know. This resolves Issue 58.

Actually, I don't think the above resolves #58, as that was concerned more about data files that are missing (e.g. 5528_5_human.bam), rather than fields that are missing. It is the stuff below that needs to be determined to resolve #58

Igonring BAM/CRAM duplicate files, as well as "phix|human|.csv|.json|.vcf" files, the new database (which builds in hours rather than days) misses 3102 files that are in the FITS production database. They cluster as:

  • Some human files owned by Richard Durbin (e.g. ILB_WGS_low_coverage_sequencing_of_Chagga_from_Tanzania); should we track those?

I would guess we don't need these, but could you follow up with @krockett?

  • "Malaria Programme R&D" and "Team 112 R&D" (missing a flowcell ID in iseq_product_metrics, investigating)

Once you've investigated, let me know if you'd like me to look at a list of these.

  • "The Malaria Cell Atlas: a comprehensive reference of the Plasmodium life cycle using single-cell RNA-seq" owned by Mara; should we track those?

I wouldn't have thought so. I think the scope of FITS should be "MalariaGEN files", and I don't think Mara's single-cell RNA-seq falls in this category. Would be good to know what criteria you used to query here. FWIW, I have previously included files for which mlwh.study.faculty_sponsor = "Dominic Kwiatkowski", though for the Pv4 build I did also pull in some files from one study owned by Julian Rayner, for which there was agreement that they could be used in MalariaGEN.

  • Loose change

Would be good to know what is included here

So I can, in principle at least, re-generate FITS from first principles. This also works well for updates (a few seconds). If you agree that this is the way forward, I will work on a mechanism to snapshot/import data from Solaris, in a documented fashion.

FWIW that sounds good to me, although it would be great to sit with you at some point and talk through in more detail. In particular, it would be very helpful to understand the sequence in which data from different sources are imported, and what data are imported on initial build versus what gets imported during incremental updates. Don't feel you have to explain that here though, probably easier f2f.

Yes, agree with all this. Perhaps the three of us could get together next time you are at Sanger @alimanfoo ?

magnusmanske commented 5 years ago
  • sequenscape library name

I think we need some kind of library identifier, but IIRC we're using mlwh_legacy_library_id so as long as we still have that then we're good.

I think Jim used mlwh.iseq_flowcell.legacy_library_id for Pf6.1, however, the library IDs Jim used in the Pf 6.0 manifest looked more like mlwh.iseq_flowcell.id_library_lims. I'd be inclined to include both of these for now. @sclaugoncalves and I have an outstanding action in SIMS to create a diagram of data flows within Sanger core systems, so I'm hoping that at some point I might better understand what these two fields mean!

That field is already tracked as "sequenscape library lims ID" in FITS. Possibly a tag duplication. I'll ignore it for now. We can always import it later.

podpearson commented 5 years ago

@magnusmanske , looking at your code, one difference between how you are pulling in data from mlwh and how I have previously done is that I think you are pulling in all studies FROM study WHERE faculty_sponsor like '%Kwiatkowski%', whereas I have only pulled in those where study.faculty_sponsor = "Dominic Kwiatkowski". I think this means you are pulling in the following two studies (faculty_sponsor = "Dominic Kwiatkowski/Matthew Berriman") in addition to the ones I have:

My guess is that neither of these would be considered to contain MalariaGEN samples. @sclaugoncalves , do you think we would want to track samples/files from either of these studies?

podpearson commented 5 years ago

Some human files owned by Richard Durbin (e.g. ILB_WGS_low_coverage_sequencing_of_Chagga_from_Tanzania); should we track those?

I would guess we don't need these, but could you follow up with @krockett?

Study ILB_WGS_low_coverage_sequencing_of_Chagga_from_Tanzania (id_study_lims = 2640), has "Dominic Kwiatkowski" as faculty_sponsor in mlwh. Where are you getting the information that these are owned by Richard Durbin? Is that from iRODS?

sclaugoncalves commented 5 years ago

@podpearson I will check with Mandy and let you know

sclaugoncalves commented 5 years ago

@podpearson regarding the R&D studies

1672 (Malaria R&D) - this study has samples from our group (Sam was using this study) 4472 (Malaria Programme R&D) - no samples from our team

podpearson commented 5 years ago

Thanks @sclaugoncalves, that's good to know . I think this brings up a fundamental question about the scope of FITS. Should it include all team112 samples, or only MalariaGEN samples? If the latter, do we have a clear definition of what is a MalariaGEN sample, and what isn't? For example, I'm assuming that the samples of Sam in 1672 are not considered MalariaGEN samples, but I could be wrong about this.

sclaugoncalves commented 5 years ago

What do you mean by a MalariaGEN sample? A sample that will be included in data releases? In that case, yes, the R&D studies are not MalariaGEN samples. If there is the need to release some R&D samples we usually move them across to a different study.

However if the scope of FITS is to track all data files I would think that should apply to R&D samples too.

magnusmanske commented 5 years ago

I had another go at getting files that are in "production FITS" but not "Subtrack FITS", excluding the usual subjects (bam/cram duplicates, vcf, csv etc.). I grouped them by study (from production FITS):

Malaria Programme R&D   1957
Plasmodium falciparum genome variation 1    1164
1168-PF-GH-AMENGA-ETEGO-SM  1149
The Malaria Cell Atlas: a comprehensive reference of the Plasmodium life cycle using single-cell RNA-seq    599
Anopheles gambiae genome variation 1    110
Team 112 R&D    109
Malaria Programme R&D - Assessment of data removed as host  76
Plasmodium falciparum Illumina sequencing R&D 1 60
Plasmodium HB3xDD2 progeny  36
Plasmodium vivax genome sequencing (Duffy)  24
Plasmodium 3D7xHB3 progeny  22
Plasmodium vivax genome variation 1 21
ILB_WGS_low_coverage_sequencing_of_Chagga_from_Tanzania 14
1231-VO-MULTI-WONDJI    6
IHTP_PWGS 1048-PV-SVSEA-PRICE-EMBARGOED 4
ILB_WGS_low_coverage_sequencing_of_Pare_from_Tanzania   4
IHTP_1130-AG-GAB-CAPUTO 4
Malaria R&D 2
AT-rich amplification   1

A lot of R&D which are not tracked like "proper" samples between MLWH and Subtrack. Some old files that are tracked in Subtrack as ".srf" rather than ".bam" (shudder). I'll try to find out what's going on with 1168-PF-GH-AMENGA-ETEGO-SM.

podpearson commented 5 years ago

@magnusmanske , by "Subtrack FITS", I assume you mean the version only importing from MLWH and Subtrack (not iRODS), is that right? If so, might be clearer to call this "non-iRODS FITS".

In the above, are you including files of human data, such as 5528_5_human.bam? If so, could you redo the analysis with all human files removed? See my previous comments at #56 .

podpearson commented 5 years ago

A lot of R&D which are not tracked like "proper" samples between MLWH and Subtrack.

I think in general R&D samples are not released to the ENA. As such, I wouldn't expect them to be in subtrack.

Some old files that are tracked in Subtrack as ".srf" rather than ".bam" (shudder).

I would strongly encourage you to try to understand how I have been doing this for Plasmodium builds. For example look at the third cell of https://github.com/malariagen/Pv4/blob/master/notebooks/rp7/20180326_Pv4_manifest.ipynb, in particular the following bit of code:


# Determine filename in subtrack (for very old samples, there is only a .srf file in subtrack)
# Note we determined the id_run cutoff (5750) for bam/srf after looking at previous runs. There seems
# to be a grey area between runs 5500 and 5750 where some files are srf and some are bam
df_return['subtrack_filename'] = df_return.index
df_return.loc[df_return['id_run'] < 5750, 'subtrack_filename'] = df_return.loc[df_return['id_run'] < 5750, 'subtrack_filename'].apply(lambda x: x.replace('.bam', '.srf'))
# Read in data from substrack, most importantly to get run accessions
sql_query = "\
SELECT \
    files.file_name as irods_filename, \
    files.bytes as subtrack_files_bytes, \
    files.timestamp as subtrack_files_timestamp, \
    files.public_date as subtrack_files_public_date, \
    submission.id as subtrack_submission_id, \
    submission.created as subtrack_submission_created, \
    submission.release_date as subtrack_submission_release_date, \
    submission.timestamp as subtrack_submission_timestamp, \
    submission.ext_db as subtrack_submission_ext_db, \
    submission.ebi_sample_acc, \
    submission.ebi_exp_acc, \
    submission.ebi_study_acc, \
    submission.ebi_sub_acc, \
    submission.ebi_run_acc \
FROM \
    submission, \
    files \
WHERE \
    files.sub_id = submission.id AND \
    files.file_name in (%s)\
" % ("'" + "', '".join(df_return['subtrack_filename']) + "'")


As you'll see in the above, for runs before 5750, I assume the file name in subtrack is .srf rather than .bam
podpearson commented 5 years ago

@sclaugoncalves

What do you mean by a MalariaGEN sample? A sample that will be included in data releases? In that case, yes, the R&D studies are not MalariaGEN samples. If there is the need to release some R&D samples we usually move them across to a different study.

I think by MalariaGEN sample I was meaning any sample collected by a MalariaGEN partner. For practical purposes I think this means any sample "owned" by Dominic, but not in an R&D study.

However if the scope of FITS is to track all data files I would think that should apply to R&D samples too.

I think we did previously say that the scope of FITS was to track all data files, so I guess we should stick with that, at least for now, and include R&D samples too. However, are we agreed that this should only include those from team112, not from other malaria teams? If so, it seems clear we should exclude sequencescape study 4472, right?

So, regarding sequencescape study 1672, does this also include samples from outside team112, or does this only have the samples from Sam? If there are samples from other teams, do we know which are from team112 and which are from other teams?

magnusmanske commented 5 years ago

@magnusmanske , by "Subtrack FITS", I assume you mean the version only importing from MLWH and Subtrack (not iRODS), is that right? If so, might be clearer to call this "non-iRODS FITS".

You are right, but I'm not entirely convinced your label is clearer...

In the above, are you including files of human data, such as 5528_5_human.bam? If so, could you redo the analysis with all human files removed? See my previous comments at #56 .

They are already excluded.

podpearson commented 5 years ago

@magnusmanske , by "Subtrack FITS", I assume you mean the version only importing from MLWH and Subtrack (not iRODS), is that right? If so, might be clearer to call this "non-iRODS FITS".

You are right, but I'm not entirely convinced your label is clearer...

Sorry, I wasn't very clear why I didn't think "Subtrack FITS" was a good name. Am I right in saying that the "Subtrack FITS" includes files from mlwh even if they are not in subtrack? But the opposite is not true, i.e. there are no samples from subtrack that are not in mlwh, right? If both these are true I think calling it "Subtrack FITS" is misleading as it might include samples that are not in subtrack. Perhaps "mlwh FITS" might be better.

magnusmanske commented 5 years ago

@magnusmanske , by "Subtrack FITS", I assume you mean the version only importing from MLWH and Subtrack (not iRODS), is that right? If so, might be clearer to call this "non-iRODS FITS".

You are right, but I'm not entirely convinced your label is clearer...

Sorry, I wasn't very clear why I didn't think "Subtrack FITS" was a good name. Am I right in saying that the "Subtrack FITS" includes files from mlwh even if they are not in subtrack? But the opposite is not true, i.e. there are no samples from subtrack that are not in mlwh, right? If both these are true I think calling it "Subtrack FITS" is misleading as it might include samples that are not in subtrack. Perhaps "mlwh FITS" might be better.

There are no files in MLWH. I wish there were.

podpearson commented 5 years ago

I appreciate that MLWH does not contain files as such, but it is certainly possible to determine a list of files from data in MLWH. Here is the function I wrote to do this which I have used (I think successfully) for recent Pf and Pv builds:

def set_irods_name(row):
    # First calculate prefix using nonhuman dependent on id_run and species (taxon)
    # Then calculate tag string (empty string if no tag)
    # Then calculate file extension
    # Then concat these three strings
    if (
        (row['taxon_id'] in [7165, 7173, 30066, 62324]) & # Anopheles gambiae, arabiensis, merus and funestus respectively
        (row['id_run'] <= 6750)
    ):
        prefix = "%d_%d_nonhuman" % (row['id_run'], row['position'])
    elif (
        ~(row['taxon_id'] in [7165, 7173, 30066, 62324]) & # Anopheles gambiae, arabiensis, merus and funestus respectively
        (row['id_run'] <= 7100)
    ):
        prefix = "%d_%d_nonhuman" % (row['id_run'], row['position'])
    else:
        prefix = "%d_%d" % (row['id_run'], row['position'])

    if np.isnan(row['tag_index']):
        tag_string = ''
    else:
        tag_string = '#%d' % row['tag_index']

    if row['id_run'] <= 10300:
        file_extension = '.bam'
    else:
        file_extension = '.cram'

    irods_filename = prefix + tag_string + file_extension
    return(irods_filename)

I was asking about files that might be accessible via mlwh but not subtrack as I had previously been assuming that there are files that can be identified through mlwh queries that can not be found in subtrack. I was surprised and delighted (simple things please simple minds) to find that it turns out that this assumption is not true (see #63), and that I had previously been assuming it was true because in some cases I was assuming the file in subtrack would have a .srf extension when in fact the file had a .bam extension. After modifying my subtrack query to match using only the file basename (i.e. ignoring the file extension), all files found using my mlwh query could also be found in subtrack (at least this is true for the 18,310 Pf files currently found by my query).

FWIW, I have changed my subtrack query from the one above to the following:

    # Determine basename (i.e. filename without extension) in subtrack
    # We use the basename because for many older samples, the file has a .srf extension in subtrack
    # but this is not done consistently, e.g. some very old samples such as 245_7_nonhuman have a .bam extension
    df_return['subtrack_basename'] = df_return.index.to_series().apply(lambda x: x.split('.')[0])

    # Read in data from substrack, most importantly to get run accessions
    sql_query = "\
    SELECT \
        files.sub_id as subtrack_files_sub_id, \
        files.file_name as subtrack_files_file_name, \
        SUBSTRING_INDEX(files.file_name, '.', 1) as subtrack_basename, \
        files.bytes as subtrack_files_bytes, \
        files.timestamp as subtrack_files_timestamp, \
        files.public_date as subtrack_files_public_date, \
        submission.id as subtrack_submission_id, \
        submission.created as subtrack_submission_created, \
        submission.release_date as subtrack_submission_release_date, \
        submission.timestamp as subtrack_submission_timestamp, \
        submission.ext_db as subtrack_submission_ext_db, \
        submission.ebi_sample_acc, \
        submission.ebi_exp_acc, \
        submission.ebi_study_acc, \
        submission.ebi_sub_acc, \
        submission.ebi_run_acc \
    FROM \
        submission, \
        files \
    WHERE \
        files.sub_id = submission.id AND \
        SUBSTRING_INDEX(files.file_name, '.', 1) in (%s)\
    " % ("'" + "', '".join(df_return['subtrack_basename']) + "'")

In case useful for comparison with FITS, I have included the output of my latest mlwh/subtrack query in #63.

krockett commented 5 years ago

Some human files owned by Richard Durbin (e.g. ILB_WGS_low_coverage_sequencing_of_Chagga_from_Tanzania); should we track those?

I would guess we don't need these, but could you follow up with @krockett?

Study ILB_WGS_low_coverage_sequencing_of_Chagga_from_Tanzania (id_study_lims = 2640), has "Dominic Kwiatkowski" as faculty_sponsor in mlwh. Where are you getting the information that these are owned by Richard Durbin? Is that from iRODS?

Hi All @podpearson, @sclaugoncalves, @magnusmanske ,

I have checked the list of samples in question here (Magnus sent me a list of 80). All are MalariaGEN samples from one of our Tanzanian trios collections. Richard Durbin should not be associated with these. Will need to ensure RD ownership has not meant any release of the data.

podpearson commented 5 years ago

Here is an attempt to summarise outstanding questions/actions from the above:

  1. @magnusmanske has stated that some human files owned by Richard Durbin (e.g. ILB_WGS_low_coverage_sequencing_of_Chagga_from_Tanzania). Where is this mapping from files to Richard Durbin coming from, iRODS?
  2. Is Magnus investigation of the missing a flowcell ID in iseq_product_metrics complete? If so, what did this show? Or is this now irrelevant?
  3. Are we agreed that FITS should only include files from team112, not from other malaria teams?
  4. Does sequencescape study 1672 include samples from outside team112?
  5. How do we ensure that we only include files from team112? If we restrict to studies where faculty_sponsor = 'Dominic Kwiatkowski' we will miss files in sequencescape study 1672, so need to know if we should include all of 1672, or just a subset.
  6. Once the above are answered we will need a new version of the MLWH/Subtrack-only import, before we can move on to https://github.com/malariagen/fits/issues/58. Note that as well as limiting the rows, we should also ensure that both mlwh.iseq_flowcell.legacy_library_id and mlwh.iseq_flowcell.id_library_lims are included.
  7. @magnusmanske , @alimanfoo and @podpearson need to get together at Sanger to understand the sequence in which data from different sources are imported, and what data are imported on initial build versus what gets imported during incremental updates.

I would suggest we either attempt to answer the above in this issue, or else create new issues.