qmarcou / IGoR

IGoR is a C++ software designed to infer V(D)J recombination related processes from sequencing data. Find full documentation at:
https://qmarcou.github.io/IGoR/
GNU General Public License v3.0
47 stars 25 forks source link

read-seqs input parameter improvement #36

Open penuts7644 opened 5 years ago

penuts7644 commented 5 years ago

Hi Quentin,

According to the documentation for the -read-seqs parameter, the input CSV file should be formatted as: with the sequence index as first column and the sequence in the second separated by a semicolon ';'.

I would think that I would be able to pass in a CSV file with multiple semicolon separated columns and that IGoR will only use the first two. However, what happens is that each line is only separated on the first semicolon character found in that line. This means that the second column is combined with the remaining columns.

Example:

This index;sequence;other_data will turn into: index as first column and sequence;other_data as second column. I would expect the following to happen: index as first column and sequence as second column.

Is there a reason for this behaviour?

Cheers, Wout

qmarcou commented 5 years ago

Hi @penuts7644, Nope there is no good reason other than: by assuming there are only two colums to the CSV the user cannot make mistakes in the column ordering. I agree this is not very handy and I will try and make a e change for a slightly more flexible format Best Quentin

decenwang commented 5 years ago

Hi All, @qmarcou @penuts7644

  1. Another question. when I input the sequences in fasta format by 'igor -read_seqs' command line, but I did not assign the index for the sample. and I found in the /tmp file, the sequences were automatically added the number with semicolon, e.g. 0; 1; 2; ………………. According to definition, the numbers are the indices, but not the DNA index/barcode. because they are from the same sample, so I really need to assign the index for each sample(all the sequences of each sample)? Anyway, I hope igor can recognize the index by itself if I can input an index file before inputing the sequences. maybe single index or dual indices.
  2. If we use the PE sequencing, fast-dump splits, trimmomatic trims. and then we get split read1 and read2 files for each sample. So both of the Read1 and Read2 within one sample should be analyzed, or I just analyze either read1 or read2?
  3. Could you please add a plugin or functionality as a translator from DNA into peptide? since TCR chains are special, they need the help of MHC I/II, namely the anchor residues. special amino acids (e.g. Arg, Glu may be different in numbers among different cohorts)

Thanks a million!