pachterlab / seqspec

machine-readable file format for genomic library sequence and structure
MIT License
114 stars 17 forks source link

getting not unique across all regions error but region_id is unique #27

Closed visze closed 7 months ago

visze commented 1 year ago

I getting the following errors on my file:

[error 1] IGVF_neuro_S1_R2_001.fastq.gz does not exist
[error 2] IGVF_neuro_S1_R1_001.fastq.gz does not exist
[error 3] IGVF_neuro_S1_R3_001.fastq.gz does not exist
[error 4] IGVF_neuro_S1_R2_001.fastq.gz does not exist
[error 5] IGVF_neuro_S1_R1_001.fastq.gz does not exist
[error 6] IGVF_neuro_S1_R3_001.fastq.gz does not exist
[error 7] IGVF_neuro_S1_R2_001.fastq.gz does not exist
[error 8] IGVF_neuro_S1_R1_001.fastq.gz does not exist
[error 9] IGVF_neuro_S1_R3_001.fastq.gz does not exist
[error 10] region_id 'IGVF_neuro_S1_R2_001.fastq.gz' is not unique across all regions
[error 11] region_id 'adapter_fwd' is not unique across all regions
[error 12] region_id 'IGVF_neuro_S1_R1_001.fastq.gz' is not unique across all regions
[error 13] region_id 'IGVF_neuro_S1_R3_001.fastq.gz' is not unique across all regions
[error 14] region_id 'adapter_rev' is not unique across all regions
[error 15] region_id 'IGVF_neuro_S1_R2_001.fastq.gz' is not unique across all regions
[error 16] region_id 'adapter_fwd' is not unique across all regions
[error 17] region_id 'IGVF_neuro_S1_R1_001.fastq.gz' is not unique across all regions
[error 18] region_id 'IGVF_neuro_S1_R3_001.fastq.gz' is not unique across all regions
[error 19] region_id 'adapter_rev' is not unique across all regions

I cannot explain error 10 to 19 because region_ids are unique.

Further error 1 to 9 complains about a missing file. But then it should also mentioned Ngn2-RNA-1_S4_R1_001.fastq.gz, Ngn2-RNA-1_S4_R2_001.fastq.gz, Ngn2-RNA-1_S4_R3_001.fastq.gz, Ngn2-DNA-1_S1_R1_001.fastq.gz, Ngn2-DNA-1_S1_R2_001.fastq.gz and Ngn2-DNA-1_S1_R3_001.fastq.gz because they are also not present.

My file:

!Assay
seqspec_version: 0.0.0
assay: "MPRA"
sequencer: "TODO"
name: mpra_shendure_assignment_80K
doi: ""
publication_date: ""
description: "Assignment library of the MPRA 80K design (caridac, neuro and random CREs)"
modalities:
  - rna # FIXME to DNA
  - rna # FIXME to DNA
  - rna
lib_struct: ""
assay_spec:
  - !Region
    parent_id: null
    region_id: assignment
    region_type: gdna # FIXME to DNA
    name: Assignment
    sequence_type: random
    sequence: X
    min_len: 0
    max_len: 1024
    onlist: null
    regions:
      - !Region
        parent_id: assignment
        region_id: barcode
        region_type: barcode # or tag?
        name: Barcode
        sequence_type: random # can in theory be onlist, but this will be a long list with all possible combinations
        sequence: XXXXXXXXXXXXXXX
        min_len: 15
        max_len: 15
        onlist: null # or filename of all possible combinations
        regions:
          - !Region
            parent_id: barcode
            region_id: IGVF_neuro_S1_R2_001.fastq.gz
            region_type: fastq # or tag?
            name: IGVF_neuro_S1_R2_001.fastq.gz
            sequence_type: random # can in theory be onlist, but this will be a long list with all possible combinations
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null # or filename of all possible combinations
            regions: null
      - !Region
        parent_id: assignment
        region_id: oligo
        region_type: gdna # FIXME to dna
        name: Oligo sequence
        sequence_type: onlist
        sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
        min_len: 300
        max_len: 300
        onlist: !Onlist
          filename: /fast/groups/ag_kircher/work/MPRA/IGVF_Y1_design/final_design/results/final_design/design.fa.gz
          location: local
          md5: 5a34f80819cc26f33f641c9aad70be09
        regions:
          - !Region
            parent_id: oligo
            region_id: adapter_fwd
            region_type: linker # FIXME to adapter
            name: Forward adapter
            sequence_type: fixed
            sequence: AGGACCGGATCAACT
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
          - !Region
            parent_id: oligo
            region_id: designed_sequence
            region_type: gdna # FIXME to dna
            name: Designed oligo sequence for testing
            sequence_type: onlist # or onlist because we knwo the design
            sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
            min_len: 270
            max_len: 270
            onlist: !Onlist
              filename: /fast/groups/ag_kircher/work/MPRA/IGVF_Y1_design/final_design/results/final_design/design.fa.gz
              location: local
              md5: 5a34f80819cc26f33f641c9aad70be09
            regions:
              - !Region
                parent_id: designed_sequence
                region_id: IGVF_neuro_S1_R1_001.fastq.gz
                region_type: fastq
                name: IGVF_neuro_S1_R1_001.fastq.gz
                sequence_type: random
                sequence: X
                min_len: 1
                max_len: 146
                onlist: null
                regions: null
              - !Region
                parent_id: designed_sequence
                region_id: IGVF_neuro_S1_R3_001.fastq.gz
                region_type: fastq
                name: IGVF_neuro_S1_R3_001.fastq.gz
                sequence_type: random
                sequence: X
                min_len: 1
                max_len: 146
                onlist: null
                regions: null
          - !Region
            parent_id: assignment
            region_id: adapter_rev
            region_type: linker # FIXME to adapter
            name: Reverse adapter
            sequence_type: fixed
            sequence: CATTGCGTGAACCGA
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
  - !Region
    parent_id: null
    region_id: dna_count_library
    region_type: cdna # or tag?
    name: DNA counts library
    sequence_type: random
    sequence: X
    min_len: 1
    max_len: 31
    onlist: null
    regions:
      - !Region
        parent_id: dna_count_library
        region_id: dna_counts
        region_type: barcode # or tag?
        name: DNA counts
        sequence_type: random
        sequence: XXXXXXXXXXXXXXX
        min_len: 15
        max_len: 15
        onlist: null
        regions:
          - !Region
            parent_id: dna_counts
            region_id: Ngn2-DNA-1_S1_R1_001.fastq.gz
            region_type: fastq
            name: Ngn2-DNA-1_S1_R1_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
          - !Region
            parent_id: dna_counts
            region_id: Ngn2-DNA-1_S1_R3_001.fastq.gz
            region_type: fastq # or tag or bc
            name: Ngn2-DNA-1_S1_R3_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
      - !Region
        parent_id: dna_count_library
        region_id: dna_umis
        region_type: umi
        name: DNA UMIs
        sequence_type: random
        sequence: XXXXXXXXXXXXXXXX
        min_len: 16
        max_len: 16
        onlist: null
        regions:
          - !Region
            parent_id: dna_counts
            region_id: Ngn2-DNA-1_S1_R2_001.fastq.gz
            region_type: fastq # or tag or bc
            name: Ngn2-DNA-1_S1_R2_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
  - !Region
    parent_id: null
    region_id: rna_count_library
    region_type: cdna # or tag?
    name: RNA counts library
    sequence_type: random
    sequence: X
    min_len: 1
    max_len: 31
    onlist: null
    regions:
      - !Region
        parent_id: rna_count_library
        region_id: rna_counts
        region_type: barcode # or tag?
        name: DNA counts
        sequence_type: random
        sequence: XXXXXXXXXXXXXXX
        min_len: 15
        max_len: 15
        onlist: null
        regions:
          - !Region
            parent_id: rna_counts
            region_id: Ngn2-RNA-1_S4_R1_001.fastq.gz
            region_type: fastq
            name: Ngn2-RNA-1_S4_R1_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
          - !Region
            parent_id: rna_counts
            region_id: Ngn2-RNA-1_S4_R3_001.fastq.gz
            region_type: fastq
            name: Ngn2-RNA-1_S4_R3_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXX
            min_len: 15
            max_len: 15
            onlist: null
            regions: null
      - !Region
        parent_id: rna_count_library
        region_id: rna_umis
        region_type: umi
        name: DNA UMIs
        sequence_type: random
        sequence: XXXXXXXXXXXXXXXX
        min_len: 16
        max_len: 16
        onlist: null
        regions:
          - !Region
            parent_id: dna_counts
            region_id: Ngn2-RNA-1_S4_R2_001.fastq.gz
            region_type: fastq
            name: Ngn2-RNA-1_S4_R2_001.fastq.gz
            sequence_type: random
            sequence: XXXXXXXXXXXXXXXX
            min_len: 16
            max_len: 16
            onlist: null
            regions: null
sbooeshaghi commented 1 year ago

Please fix the spec (making the modalities unique) using the new spec that contains dna as a controlled vocabulary. Please let me know if you still encounter issues once you've done that.

sbooeshaghi commented 1 year ago

I've updated seqspec check to now report errors contained in your spec. In particular:

[error 1] modalities [rna, rna, rna] are not unique
[error 2] region_id 'assignment' of the first level of the spec does not correspond to a modality [rna, rna, rna]
[error 3] region_id 'dna_count_library' of the first level of the spec does not correspond to a modality [rna, rna, rna]
[error 4] region_id 'rna_count_library' of the first level of the spec does not correspond to a modality [rna, rna, rna]
...
sbooeshaghi commented 10 months ago

Hi, can you verify for me if this worked for you?

sbooeshaghi commented 7 months ago

Going to close this due to inactivity. Please feel free to reopen if you continue to have issues.